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1.  EXECUTIVE  SUMMARY 


This  document  is  the  final  report  from  ITTDCD  to  RADC  for  Contract  Number 
F30602-81-C-0155  entitled  Limited  Connected  Speech  Experiment(LCSE).  The  pur¬ 
pose  of  this  contract  was  to  demonstrate  that  Connected  Speech  Recognition  (CSR) 
can  be  performed  in  real  time  on  a  vocabulary  of  one  hundred  words  and  to  test  the 
performance  of  the  CSR  system  for  twenty  five  male  and  twenty  five  female  spaak- 
ers.  This  report  describes  ITTDCD’s  real  time  laboratory  CSR  system,  the  data  base 
and  training  software  developed  in  accordance  with  the  contract,  and  the  results  of 
the  performance  tests. 

ITTDCD's  real  time  laboratory  system  is  a  flexible  speech  recognition  program 
which  operates  in  an  FPS  AP-120B  array  processor  that  is  connected  to  a  VAX 
11/780  computer.  The  user  can  easily  define  the  vocabulary  and  syntax  for  a  given 
recognition  task  via  interactive  syntax  specification  commands.  In  addition  to  per¬ 
forming  task  specific  phrase  recognition,  the  CSR  program  has  a  "voice-control" 
feature  which  allows  the  user  to  control  the  system  via  spoken  commands  of  his  own 
choosing.  A  versatile  training  capability  permits  the  user  to  adapt  to  the  speaker 
dependent  system  by  speaking  both  words  and  phrases  from  a  vocabulary  which  he 
has  defined.  The  CSR  system  is  also  a  valuable  research  and  development  tool 
with  analysis  mode  and  recognition  experiment  mode  features. 

An  airline  query  task  was  chosen  to  define  the  100  word  recognition  vocabulary 
for  the  LCSE  data  base.  The  phrases  associated  with  this  task  are  representative  cf 
a  simplified  air  travel  information  retrieval  application.  This  syntax  and  vocabula-y 


were  designed  primarily  with  the  goal  of  user  ftexibilty  in  the  task  and  not  with  the 
goal  of  optimal  recognition  performance.  The  vocabulary  includes  three  phoneti¬ 
cally  similar  groups  of  words:  the  digits,  the  teens  ("ten"  through  "nineteen''),  and 
the  decades  ("twenty''  through  "ninety”).  The  vocabulary  also  includes  three  func¬ 
tion  words  "of",  for",  and  "the"  which  are  often  unstressed  in  continuous  speech. 
Analog  recordings  were  made  for  25  males  and  25  females,  each  speaking  words  and 
phrases  from  the  100  word  vocabulary. 

Template  training  is  a  critical  step  for  speaker  dependent  CSR  systems  and  a 
major  accomplishment  of  the  Limited  Connected  Speech  Experiment  was  the 
development  of  two  effective  training  techniques,  template  extraction  and  template 
averaging.  ITTDCD's  template  extraction  algorithm  automatically  locates  and  saves 
the  speech  parameters  of  words  embedded  in  continuous  phrases.  The  template 
averaging  technique  performs  a  clustering  analysis  on  multiple  tokens  for  the  same 
word  and  averages  the  speech  parameters  of  similar  tokens.  Training  tokens  out¬ 
put  by  the  template  extraction  process  are  input  to  the  template  averager  along 
with  tokens  of  individual  words  spoken  in  isolation.  Each  of  the  50  data  base  sub¬ 
jects  spoke  three  repetitions  of  the  100  word  vocabulary  as  well  as  68  phrases  which 
could  be  used  for  template  extraction. 

The  performance  tests  were  conducted  for  fifty  speakers  each  speaking  50  ran¬ 
dom  phrases  from  the  airline  query  grammar.  The  phrases  contained  7.4  words  on 
average.  The  median  word  recognition  accuracy  was  94.5%  for  all  words.  Ignoring 
errors  on  the  words  "of", "for",  and  "the",  the  median  word  rate  was  96.B%  and  the 
median  phrase  recognition  rate  was  64%.  A  phrase  Is  considered  correct  if  all  words 
are  correcty  identified.  An  extensive  error  analysis  of  the  performance  test  results 
was  undertaken  with  all  word  errors  being  assigned  to  ten  error  classification  types. 
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Function  words  aside,  the  major  cause  of  word  errors  was  found  to  be  confu¬ 
sions  amongst  the  digits,  decades,  and  teens.  Typical  examples  are  confusions 
between  "seven"  and  "seventy",  "eight"  and  "eighty",  and  "sixty"  and  "sixteen".  In 
order  to  examine  the  impact  of  vocabulary  selection  on  recognition  performance 
for  a  given  task,  another  performance  test  was  designed  and  conducted  with  an  82 
word  version  of  the  airline  query  grammar.  The  decade  and  teen  nodes  were  elim¬ 
inated  from  the  syntax  and  all  test  phrases  containing  decade  or  teen  words  were 
eliminated,  reducing  the  average  number  of  test  utterances  per  speaker  from  50  to 
32.  Excluding  "of,  for",  and  "the"  errors,  a  98.0%  word  rate  and  90.5%  phrase  rate 
was  achieved  on  the  82  word  vocabulary  test. 

On  August  20,  1982,  a  demonstration  of  HTDCD's  real  time  laboratory  CSR  sys¬ 
tem  was  presented  to  a  representative  of  RADC.  Six  of  the  fifty  performance  lest 
subjects  were  asked  to  speak  a  series  of  phrases  from  the  100  word  airline  query 
grammar.  These  phrases  were  recognized  with  an  accuracy  comparable  to  that 
achieved  on  the  performance  test.  In  the  course  of  the  demonstration,  various 
features  or  the  CSR  system  were  exhibited  including  voice-control,  training,  tem¬ 
plate  extraction,  template  averaging,  and  analysis. 
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2.  INTRODUCTION 


The  Limited  Connected  Speech  Experiment  had  two  primary  goals.  First,  to 
demonstrate  a  Connected  Speech  Recognition  (CSR)  system  which  provides  real 
time  response  on  a  vocabulary  of  100  words.  And  secondly,  to  test  the  recognition 
performance  of  this  CSR  system  over  a  data  base  consisting  of  25  male  and  25 
female  speakers.  This  chapter  gives  an  overview  of  the  tasks  that  were  carried  out 
to  achieve  these  goals 

2.1  Development  of  CSR  Control  Software 

Executive  software  was  developed  on  the  VAX  computer  to  control  the  overall 
operation  of  the  CSR  system  including  training,  recognition,  experiment,  and 
analysis.  In  addition  software  was  developed  to  provide  for  creation  and  mainte¬ 
nance  of  speech  parameter  files  for  word  templates  and  phrases.  A  description  of 
the  operation  of  the  CSR  system  appears  in  Chapter  3  along  with  a  brief  description 
of  its  recognition  algorithm. 

2.2  Development  of  Syntax  Specification  Software 

An  interactive  syntax  specification  program  was  designed  and  implemented. 
This  software  enables  the  operator  to  specify  the  vocabulary  words,  and  the  finite 
state  grammar  which  define  a  CSR  task.  The  resulting  syntax  file  is  employed  to 
guide  training  and  recognition  software  on  the  CSR  system.  Further  detail  on  syn¬ 
tax  specification  is  presented  in  Chapter  3. 
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2.3  Selection  of  a  Syntax  and  Vocabulary 

An  airline  query  task  was  chosen  to  define  the  100  word  recognition  vocabulary 
for  the  LCSE  data  base.  It  was  designed  to  be  representative  of  limited  syntax  appli¬ 
cation  areas  for  speech  recognition.  The  phrases  are  representative  of  a  simplified 
air  travel  information  retrieval  application.  The  entire  100  words  and  associated 
finite  state  syntax  are  presented  in  detail  in  Chapter  4  of  this  report. 

2.4  Generation  of  the  Data  Base. 

Analog  recordings  were  made  for  25  males  and  25  females,  each  speaking 
words  and  phrases  from  the  100  word  airline  query  grammar.  About  half  of  the 
speakers  were  chosen  from  within  Irl TU CD's  San  Diego  laboratory  and  the  remainder 
were  selected  from  agency  referrals.  None  of  the  speakers  had  any  prior  experi¬ 
ence  with  speech  recognition  systems.  Files  of  digital  speech  parameters  for  each 
word  and  phrase  were  obtained  by  playing  the  analog  tapes  into  a  filterbank.  The 
steps  taken  to  generate  this  data  base  are  discussed  in  Chapter  4. 

2.5  Investigation  of  Template  Averaging  Techniques 

After  a  review  of  the  literature,  a  template  averaging  technique  was  imple¬ 
mented  and  tested  on  an  existing  data  base  of  connected  digit  phrases  from  five 
speakers.  To  obtain  a  performance  baseline  for  evaluating  the  technique,  recogni¬ 
tion  experiments  were  performed  using  single  tokens  as  templates  for  each  word. 
CSR  experiments  were  then  conducted  with  averaged  templates  A  description  of 
the  template  averaging  technique  is  contained  in  Chapter  5  along  with  results  of  the 
template  averaging  study. 

2.6  Extraction  Templates 

Software  was  developed  for  automatically  extracting  templates  from  connected 
speech  utterances  for  the  purpose  of  combining  them  with  existing  templates  of  the 


-5- 


same  vocabulary  word.  The  recognition  system  itself  controls  the  template  extrac¬ 
tion  process,  as  described  in  Chapter  5. 

2.7  Integration  and  Test  of  Software 

Training  and  recognition  software  were  integrated  on  the  VAX  -  AP120B  system 
Ten  of  the  data  base  subjects  were  designated  as  development  speakers  and  a 
series  of  recognition  experiments  were  performed  to  establish  the  appropriate 
training  technique  for  the  50  speaker  performance  test  Chapter  6  describes  the 
experimental  findings  of  the  development  testing  process. 

2.8  Performance  Test 

Templates  were  prepared  for  the  50  speaker  data  base  using  the  template 
extraction  and  template  averaging  software  50  phrases  from  the  airline  query 
grammar  were  input  to  the  CSR  system  for  each  of  the  50  speakers.  Performance 
test  results  are  presented  in  Chapter  7  and  an  analysis  of  word  recognition  error s  is 
addressed  in  Chapter  8. 

2.0  Demonstration 

A  demonstration  of  the  CSR  system  was  prepared  and  conducted  for  govern¬ 
ment  representatives.  Six  speakers  from  the  performance  test  group  participated 
in  the  demonstration. 
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3.  A  REAL  TIME  LABORATORY  CONNECTED  SPEECH  RECOGNITION  SYSTEM 


The  Connected  Speech  Recognition  (CSR)  system  was  developed  at  ITTDCD's 
San  Diego  laboratory  to  accomplish  the  goal  of  recognizing  a  syntax  constrained. 
100  word  vocabulary  In  real  time  with  a  low  error  rate.  This  chapter  gives  an  over¬ 
view  of  the  operation  and  features  of  the  system  and  a  brief  description  of  the  CSR 
recognition  algorithm  upon  which  the  system  is  based. 

3.1  Operational  Overview* 

Figure  3.1  shows  a  diagram  of  the  limited  connected  speech  exploratory 
development  system.  The  CSR  system  operates  in  an  FPS  AP-120B  array  processor 
which  is  connected  to  a  VAX  11/780  computer.  Hie  recognition  algorithm  is  con¬ 
tained  entirely  in  the  array  processor  with  the  VAX  serving  as  the  executive  con¬ 
troller  which  handles  the  user  interface,  long  term  storage  of  templates  and  gram¬ 
mars,  and  support  software  such  as  the  template  averaging  routines.  An  analog 
fllterbahk  is  connected  to  the  array  processor  via  a  DMA  channel  to  permit  realtime 
processing  of  the  speech  signal. 

The  user  controls  the  system  by  keyboard  input  at  a  display  terminal  and  by  a 
Bet  of  single  word  voice  commands.  At  any  time,  the  system  is  in  one  of  three  dis¬ 
tinct  states,  as  illustrated  in  the  state  diagram  of  Figure  3.2.  These  states  are 
called  the  Command  state,  the  Recognition  state,  and  the  Voice  Control  state.  The 
Command  state  is  the  normal,  or  default  state  of  the  system.  In  this  state,  25 
different  commands  can  be  entered  from  the  keyboard  or  read  from  specified  com¬ 
mand  files.  Some  of  these  commands  do  the  following: 

*A  complete  users  guide  to  the  CSR  system  is  included  u  Appendix  A. 
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FIGURE  3-1 

LIMITED  CONNECTED  SPEECH  EXPLORATORY  DEVELOPMENT  MODEL 
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•  Read  the  syntax  from  a  specified  file, 

•  Set  an  environment  variable  (eg  .vocabulary  size), 

•  Train  the  vocabulary,  or  a  particular  word, 

•  Turn  the  experiment  mode  on  and  collect  statistics, 

•  Execute  a  specified  command  file  (which  contains  commands  like 
these), 

•  Execute  a  VAX  11/7B0  operator  command  and  return  to  the  CSR  sys¬ 
tem, 

•  Exit  from  the  CSR  system 

In  addition  to  these  commands,  executing  a  recognize  command  changes  the  state 
of  the  system  to  the  Recognition  state,  and  executing  the  control  command 
changes  it  to  the  Voice  Control  state. 

In  the  Recognition  state,  the  system  will  recognize  any  syntactically  legal 
phrase  specified  by  the  grammar  and  the  vocabulary  of  the  current  task  After  the 
phrase  is  spoken  the  recognized  text  is  displayed  on  the  terminal  and  the  system 
either  returns  to  the  Command  state,  if  the  environment  variable  "Single_recog" 
has  been  set  on.  or  remains  in  the  Recognition  state  ready  for  the  next  utterance,  if 
the  variable  has  been  set  off.  In  the  latter  mode  the  user  may  return  to  the  com¬ 
mand  state  by  hitting  the  interrupt  key.  At  any  point,  while  speaking  a  phrase,  the 
user  may  cancel  the  phrase  by  immediately  saying  ‘  Cancel".  A  transition  to  the 
Voice  Control  state  is  accomplished  by  saying  the  word  "Control". 

In  the  Voice  Control  state  a  small  subset  of  the  35  commands  available  to  the 
user  in  the  Command  state  are  activated  by  voice.  When  the  user  trains  the  system, 
he  is  prompted  to  speak  these  control  words,  along  with  the  task  dependent  vocabu- 
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lary  words  The  voice  commands  currently  implemented  are: 

•  "Display";  Displays  the  current  values  of  the  environment  variables. 

•  "Options":  Displays  the  five  best  scoring  phrase  recognition  options 
for  the  last  task  phrase  spoken  in  the  Recognition  state. 

•  "Word-scores":  Displays  the  individual  word  scores  for  the  last  task 
phrase  spoken  in  the  Recognition  state. 

•  "Recognize":  Change  the  system  state  to  the  recognition  state. 

•  "Offline":  Release  the  array  processor  and  return  to  the  command 
state. 

As  in  the  Recognition  state,  the  user  can  also  switch  from  the  Voice  Control  state  to 
the  Command  state  by  hitting  the  interrupt  key. 

3.2  The  ITTDCD  CSR  Algorithm 

Figure  3-3  gives  an  overview  of  the  HTDCD  CSR  algorithm.  Three  types  of 
inputs  are  supplied  to  the  system,  as  shown  on  the  left-hand  side  of  the  figure.  The 
input  speech  undergoes  a  parametric  analysis  performed  by  a  Charge  Transfer  Dev¬ 
ice  (CTD)  band  pass  fllterbank  This  fllterbank  waB  previously  developed  in  conjunc¬ 
tion  with  an  earlier  RADC  contract,  the  Solid  State  Audio/Speech  Processor  Analysis 
(SSA/SPA)  Contract  (No.  F30602-78-C0359).  Using  eighteen  1/3  octave  switched- 
capacitor  band  pass  filters  and  one  full  octave  filter  the  fllterbank  covers  a  fre¬ 
quency  range  of  100Hz  to  9500  Hz  and  supplies  19  coefficients  every  10  ms  to  a 
parameter  reduction  algorithm. 

The  parameter  reduction  algorithm  performs  variable  frame  rate  encoding  to 
remove  redundant  frames  and  converts  the  parameters  to  ten  mel-cepstral 
coefficients  using  a  mel-cosine  linear  transformation.  Details  of  this  algorithm  can 
be  found  in  sections  2. 1.2.2  and  3.2 .3  of  the  SSA/SPA  final  report. 
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The  coefficients  from  the  parametric  analysis  step  are  passed  to  the  word 
matching  algorithm.  This  algorithm  compares  each  word  template  with  the  spoken 
utterance  using  a  non-linear  time  alignment  process  carried  out  by  the  dynamic 
programming  match  algorithm.  The  non-linear  time  alignment  is  necessary  to 
account  for  the  natural  time  variations  between  different  utterances  of  the  same 
word.  The  time  warp  constraints  used  in  the  algorithm  force  the  length  of  the  spo¬ 
ken  word  to  be  between  one-half  and  twice  the  length  of  its  template. 

A  second  level  dynamic  programming  algorithm  is  implemented  in  the  Word 
Sequence  Control  block  to  control  word  template  matching  and  to  concatenate 
matched  templates  into  the  connected  word  sequence  which  best  matches  the 
input  utterance.  Syntactic  constraints  define  the  set  of  word  sequences  that  can  be 
recognized  by  the  system  as  sentences. 

The  ITTDCD  CSR  algorithm  processes  an  uni.  .own  utterance  from  left  to  right 
to  find  a  sequence  of  words  that  closely  matches  it.  Disregarding  syntactic  control 
for  now,  the  process  takes  place  as  follows  The  dynamic  programming  algorithm 
processes  the  unknown  utterance  one  frame  at  a  time.  At  every  frame,  matching 
begins  for  all  word  templates.  After  a  delay  of  \i  of  the  length  of  a  word  template 
each  frame  Is  a  possible  ending  point  for  the  template  started.  Thus,  at  each 
frame,  F,  in  which  the  matching  of  some  word  template,  W,  ends,  a  set  of  candidate 
partial  phrases  (word  sequences)  is  formed  by  appending  the  word  W  to  the  set  of 
partial  phrases  ending  where  W  began.  This  is  done  for  every  word  ending  at  frame 
F,  resulting  in  a  large  set  of  candidate  partial  phrases  ending  at  the  frame. 
Because  of  memory  and  processing  limitations,  only  the  best  N  candidates  are 
retained,  where  N  Is  determined  for  each  frame  based  on  the  scores  of  the  compet¬ 
ing  candidates.  This  technique  of  varying  the  number  of  candidate  phrases  con¬ 
sidered  at  each  frame  is  called  a  beam  search  [Lowerre  and  Reddy  -  I960].  With  the 
beam  search  strategy,  the  system  allocates  more  of  its  resources  to  nodes  in  the 


-10- 


Figure  3-A 

An  Example  of  a  Finite  State  Grammar 
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node-to-node  connection  matrix  which  completely  describe  the  syntax. 

Figure  3-5  expands  in  detail  the  operation  of  the  Word  Sequence  Control  Module 
appearing  as  a  simple  block  in  the  algorithm  overview  of  Figure  3-3 .  The  control 
module  keeps  track  of  the  best  candidate  phrases  ending  at  each  frame.  At  every 
frame  of  the  input  utterance,  the  competing  partial  phrases  are  stored  in  the 
phrase  description  tables  shown  near  the  bottom  of  the  figure.  Words  matching  the 
previous  portion  of  the  input  utterance  are  used  to  extend  partial  phrases  from  the 
table  to  obtain  a  new  set  of  partial  phrases.  The  new  set  is  limited  by  the  beam 
search  strategy  and  stored  in  the  phrase  description  tables.  The  grammar  node 
states  specified  by  the  candidate  phrases  thus  determine  which  nodes  and  words 
will  be  processed  in  the  next  frame.  The  new  grammar  node  states  are  expanded 
into  a  set  of  new  template  candidates  by  using  word-node  membership  information. 
These  new  template  candidates  will  be  used  by  the  recognition  algorithm  to  match 
the  next  part  of  the  input  utterance 

When  the  end  of  the  utterance  is  detected,  the  next  grammar  node  states  are 
checked  to  determine  which  of  them  are  connected  to  a  final  Btate.  The  best  scor¬ 
ing  candidate  phrase  leading  to  a  final  state  is  reported  as  the  recognized  sentence. 

The  templates  used  to  match  the  input  utterance  are  obtained  from  the  tem¬ 
plate  generation  software  from  training  speech.  The  techniques  used  in  this  part  of 
the  system  are  the  discussed  in  Chapter  5. 
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4.  THE  LIMITED  CONNECTED  SPEECH  DATA  BASE 


The  development  of  a  connected  speech  recognition  system  requires  extensive 
testing  on  large  data  bases  from  a  wide  sampling  of  the  speaker  population  to 
obtain  statistically  significant  performance  data  For  the  LCSE  work  ITTDCD 
recorded  three  repetitions  of  a  100  word  vocabulary  and  166  sentences  generated 
from  the  vocabulary  for  each  of  25  male  and  25  female  speakers.  Generation  of  this 
data  base  required  seven  carefully  performed  tasks.  They  are: 

1.  Data  base  design, 

2.  Design  and  acquisition  of  recording  facilities, 

3.  Design  and  implementation  of  data  base  collection  software, 

4.  Selection  of  a  speaker  population. 

5.  Data  base  collection, 

6.  Data  base  pruning, 

7.  Data  base  processing. 

These  seven  tasks  will  be  discussed  in  this  chapter  and  the  related  Appendix  B. 

4. 1  LCSE  Data  Base  Design 

In  response  to  the  contract  requirements  a  limited  syntax  100  word  vocabulary 
data  base  was  designed  to  be  recorded  by  25  females  and  25  males.  Figure  4-la 
shows  the  finite  state  syntax  node  structure  and  Figure  4- lb  gives  the  vocabulary 
words  assigned  to  each  node.  The  syntax  and  vocabulary  was  chosen  to  be 
representative  of  a  simplified  air  travel  information  retrieval  application  which  we 
call  the  airline  query  task.  However,  the  syntax  and  vocabulary  were  designed  pri- 
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Figure  A-la 

Syntax  Node  Structure  100  Word  Airline  Query 
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maruy  with  thn  goal  of  user  ftexibiity  m  the  task  and  not  with  the  goal  of  optimal 
recognition  performance.  Thus,  the  vocabulary  includes  the  dJgLts,  the  teens  ("ten" 
through  "nineteen’),  and  the  decades  ("twenty"  through  "ninety"),  which  can  be 
spoken  in  various  combinations  within  an  airline  query  phrase  The  similarity  of 
digit,  teen,  find  decade  words  can  pose  a  challenging  recognition  task  since  the  syn- 
tax  allows  any  of  these  words  to  appear  in  the  same  place  in  a  phrase 

The  100  word  vocabulary  also  includes  26  alpha  words  ("alpha",  "bravo  ",  "char- 
Ue",  etc  ),  which  comprise  one  node.  This  node  both  precedes  and  follows  the  digit- 
teen-decade  sequence  in  the  syntax  and  provides  a  test  of  the  systems  ability  to 
match  many  templates  against  the  input  utterance  in  real  time. 

The  LCSE  connected  data  base  was  designed  to  be  subpart  of  a  larger  data  base 
collection  effort.  The  larger  data  base  included  a  200  and  300  word  limited  syntax 
airline  query  grammar,  a  connected  digit  component,  an  alphabet  spelling  com¬ 
ponent,  and  a  diagnostic  rhyme  component.  The  content  of  this  data  base,  i  ie 
recording  facilities,  and  the  procedures  use  to  collect  the  dati  '  ave  bew  '  escribed 
in  a  paper*  given  at  the  Workshop  on  Standardization  for  Speech  I/O  Technology  on 
March  18.  1982.  This  paper  covers  the  first  five  topics  listed  above  for  the  LCSE  con¬ 
nected  speech  data  base  and  is  Included  In  Appendix  B.  The  last  two  topics  of  data 
base  pruning  and  data  base  processing  are  discussed  in  the  remainder  of  this 
chapter. 

4.8  Data  Base  Pruning; 

A  total  of  63  speakers  (31  males  and  32  females)  were  recorded  to  permit 
selection  of  speech  data  for  25  males  and  25  females  which  is  free  from  recording 
and  processing  problems.  After  recording,  the  data  base  was  pruned  from  63 

Londell.B.  P.,  Smith,  A.  R.,  Koble,  H.  M.,  and  Alcove,  U.  L.,  "A  Continuous  Speech  Data  Base," 
presented  at  the  Workshop  on  Standardization  for  Speech  I/O  Technology,  National  Bureau  of 
Standards,  Gaithersburg,  Md.,  Uarch  16-19,  1962. 
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speakers  to  50  speakers.  This  pruning  was  done  before  any  recognition  perfor¬ 
mance  testing  and  was  based  on  either  random  elimination,  or  on  the  presence  of 
difficulties  m  the  recording  procedures.  The  following  table  lists  the  reasons  for 
speaker  elimination  due  to  recording  difficulties: 

•  For  the  first  speaker  (male)  the  recording  procedures  had  not  been  com¬ 
pletely  tested  so  that  the  recording  session  was  very  long  and  frag¬ 
mented, 

•  One  speaker  (female)  did  not  complete  the  recording  session, 

•  Excessive  tampering  with  the  close  talking  head  mounted  microphone 
during  the  session  (2  males), 

•  Tape  recorder  was  set  with  the  variable  pitch  control  activated  (2  males), 

•  Part  of  analog  tape  was  recorded  over  (l  male), 

•  Excessive  environment  noise  outside  of  recording  room  (3  females). 

Although  some  of  these  problems  may  in  fact  be  conditions  which  a  speech 
system  might  encounter  (eg  ,  environmental  noise  and  microphone  movement), 
they  were  not  conditions  that  we  wanted  to  study  in  this  contract.  After  pruning 
speakers  with  recording  difficulties,  the  resultant  data  base  contained  25  males 
and  28  females.  The  remaining  three  females  were  eliminated  by  random  selec¬ 
tion. 

4.3  Data  Base  Processing 

The  above  data  base  generation  steps  resulted  in  a  set  of  analog  tapes  con¬ 
taining  training  and  test  data  for  50  speakers.  Although  these  tapes  could  have 
been  used  directly  in  the  system  to  train  it  for  each  speaker  and  then  to  test  it, 
such  a  procedure  would  require  many  hours  of  tape  handling  as  various  tests  were 
run.  These  problems  were  circumvented  by  processing  the  data  once  through  the 
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filterbank  to  create  a  series  of  digital  files  containing  word  training  tokens  and  Le^t 
phrases.  These  files  were  then  used  over  and  over  again  by  computer  programs  to 
train  and  test  the  system  as  it  was  developed. 

The  data  base  processing  task  is  a  two  step  process  In  the  first  step  the 
operator  plays  about  five  minutes  of  speech  (determined  by  available  disk  space) 
from  an  audio  tape  through  the  filterbank/ array  processor  front  end  to  generate 
an  output  file  of  filter  parameters.  In  the  second  step,  a  VAX  11/700  program 
processes  the  filter  parameter  file  using  the  recording  session  history  file*.  The 
VAX  program  uses  marking  tones  in  the  filter  parameter  file  to  synchronize  the 
frame  position  with  time  marks  in  the  history  file.  The  VAX  program  also  performs 
endpoint  detection  to  find  each  utterance  in  the  output  file  and  splits  the  input 
filter  parameter  file  into  smaller  files  which  are  tagged  with  an  ASCII  label  desciib- 
ing  the  utterance.  Occasional  endpoint  detection  problems  occured  when  a 
speaker  corrected  himself  in  midst  of  a  word  or  phrase  without  pausing  before 
repeating  the  text.  In  these  cases  (less  them  IX  of  all  utterances),  the  operator 
listened  to  the  speech  and  used  an  amplitude  plot  to  determine  the  proper  window 
within  which  the  endpoint  detection  algorithm  could  be  safely  rerun. 


As  explained  in  Appendix  B,  the  recording  history  file  contains  the  exact  text  with  which  the 
speaker  was  prompted  together  with  e  time  mark.  The  time  mark  is  computed  relative  to  a 
tone  recorded  on  the  tape  at  the  beginning  of  the  session. 
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5.  TRAINING  TECHNIQUES 


Template  training  is  a  critical  step  for  speaker  dependent  CSR  systems,  since 
the  performance  of  the  system  is  limited  by  the  degree  to  which  the  templates 
model  the  test  speech  This  chapter  describes  two  techniques  that  were  used  to 
obtain  improved  word  template  models,  They  are  template  averaging  and  the 
extraction  of  templates  from  connected  speech. 

5. 1  Template  Averaging  Study 

A  template  averaging  study  was  performed  to  evaluate  the  effectiveness  of 
template  clustering  and  averaging  techniques  in  connected  speech  recognition. 
After  a  review  of  the  literature,  we  decided  to  employ  the  Unsupervised  Clustering 
Without  Averaging  (UWA)  algorithm  as  described  by  Rabiner  and  Wilpon*  Since 
complete  details  of  the  technique  are  available  in  their  paper  we  will  only  give  an 
overview  of  the  technique  before  discussing  our  results. 

The  technique  was  implemented  on  the  PDP  11/60  to  cluster  and  average  mul¬ 
tiple  tokens  or  samples  for  a  given  word.  Inputs  to  this  software  include  the  file 
names  of  individual  tokens,  a  clustering  distance  threshold,  and  the  number  of 
desired  output  templates.  The  software  performs  three  steps  to  obtain  the  aver¬ 
aged  templates:  first,  it  computes  the  similarity  distance  and  the  frame-to-frame 
correspondence  between  each  pair  of  tokens,  second,  it  applies  a  clustering  algo¬ 
rithm  to  the  tokens,  and  finally,  it  averages  the  speech  parameters  across  the 
tokens  of  each  cluster  on  a  frame-by-frame  basis  according  to  the  previously 

Rabiner,  L,  and  Wilpon,  J.,  "Considerations  in  Applying  Clustering  Techniques  to  Speaker 
Independent  Word  Recognition,"  Journal  of  the  Acoustical  Society  of  America,  89  (3),  Sep¬ 
tember  1979. 
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computed  frame-to-frame  correspondence. 

In  the  first  step  a  dynamic  programming  algorithm  (DPA)  is  used  to  compute 
the  similarity  distance  and  the  frame-to-frame  correspondence  between  pairs  cf 
input  tokens  The  DPA  finds  the  best  non-linear  correspondence  between  a  pair  of 
tokens  The  average  distance  between  the  frames  that  are  aligned  by  the  process 
gives  the  overall  similarity  distance  between  the  tokens.  Constraints  are  imposed 
on  this  alignment  process  so  that  either  token  cannot  be  "stretched"  more  than 
twice  its  length  to  match  the  other  token.  Thus,  in  some  cases  tokens  do  net 
match  and  are  given  a  large  similarity  distance  In  the  clustering  step  which  fol¬ 
lows,  these  tokens  will  be  prevented  from  appearing  in  the  same  cluster.  The  algo¬ 
rithm  thus  prevents  such  tokens  from  being  averaged 

The  similarity  distances  between  all  token  pairs  give  an  intertoken  distance 
matrix.  The  cluster  step  of  the  algorithm  uses  the  matrix  to  compute  the  minimi,  x 
center  of  the  token  set.  The  minimax  center  is  simply  that  token  For  which  the 
maximum  distance  to  any  other  token  is  minimized.  Then,  in  an  iterative  process, 
any  token  whose  distance  to  the  center  exceeds  the  clustering  threshold  (supplied 
to  the  algorithm)  is  removed.  A  new  minimax  center  is  then  computed  on  the 
reduced  set.  The  process  iterates  until  the  center  does  not  change.  Alt  tokens 
within  the  final  set  are  within  the  cluster  threshold  and  form  the  cluster  and  the 
final  center  token  becomes  cluster  center.  Tokens  that  have  been  removed  are 
reprocessed  to  find  a  second  cluster  The  process  continues  until  no  tokens 
remain.  A  variable  number  of  clusters  are  computed  and  each  token  is  finally 
assigned  to  a  cluster  (an  outlying  token  might  form  a  cluster  of  one). 

The  final  averaging  step  averages  all  tokens  within  each  cluster.  The  frame- 
to-frame  correspondence  of  each  token  with  the  center  token  determines  which 
frames  are  averaged  together  to  form  the  final  average  template. 
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5.1.1  teUmto^A*i»agUW^toenUwilhrai*rb,,nkP‘ir“I‘el’ere' 

Averted  templates  were  evaluated  In  several  speech  recognition  experi¬ 
ments.  A  connected  digit  data  base  of  five  speakers  (three  male  and  two  female) 
was  used  in  these  experiments  The  training  data  base  consisted  of  six  token  p 
word  per  speaker  for  each  of  the  ten  digits  The  test  data  base  contained  150 
three,  four,  end  five  digit  phrases  per  speaker.  To  provide  a  baseline  lor  these 
experiments,  each  of  the  six  token  sets  for  each  speaker  were  used  as  templates 
The  first  line  in  Table  5-1  is  the  average  recognition  rate  in  the  baseline  experi¬ 
ments.  Hie  second  line  shows  the  performance  using  the  center  ol  all  tokens  tor 
each  word  That  is.  the  duster  threshold  was  set  at  a  maximum  value  so  that  only 
one  cluster  was  found.  For  the  third  experiment,  thi.  threshoid  was  dropped  so 
that  more  than  one  cluster  may  have  been  formed  per  word  The  center  token  of 
the  largest  cluster  was  used  to  represent  the  word.  For  the  experiment  shown  on 
the  fourth  line,  the  tokens  of  the  largest  cluster  were  averaged  This  experiment 
showed  that  a  tugher  phrase  recognition  rate  (76F.)  could  be  achieved  by  averaging 
parameters  than  by  techniques  ol  selecting  individual  tokens  to  represent  the 

words. 


5.1.2  Experiments  Averaging  Speech  Parameters. 

As  described  in  Chapter  3,  the  CSR  system  uses  mel-cepstral  coefficients  fcr 
its  speech  parameters.  These  are  obtained  from  the  fllterbank  parameters  by  a 
Linear  transformation  During  the  template  averaging  study  another  type  of  linear 
transformation  which  was  then  under  investigation  was  employed  This  transfor¬ 
mation  was  obtained  by  performing  a  linear  discriminant  analysis  on  marked  and 
labeled  speech  segments  Although  the  linear  discriminant  transformation  tech¬ 
nique  was  abandoned  (it  was  too  speaker  dependent  for  the  Limited  CSR  system), 
the  results  of  the  averaging  study  using  these  parameters  are  presented  here 
because  the  parameters  are  similar  to  the  mel-cepstral  coefficients  used  in  the 
system.  Table  5.2  presents  results  of  the  experiments.  The  right  hand  column 
labelled  "Not  Avgd”  represents  the  results  of  running  each  token  set  of  Linear 
discriminant  parameters  separately  and  averaging  the  results.  This  column 
represents  a  benchmark  for  evaluating  the  effectiveness  of  the  averaging  process 
The  columns  headed  by  the  clustering  thresholds  present  the  template  averaging 
results  In  each  case,  the  tokens  in  the  largest  cluster  were  averaged  producing 
one  averaged  output  template  per  word  and  in  each  case,  improved  recognition 
results  were  obtained  in  comparison  to  the  baseline  figures.  An  additional  experi¬ 
ment  was  performed  in  which  templates  made  from  averaged  filter  bank  parame¬ 
ters  (line  4  of  Table  5-l)  were  transformed  prior  to  recognition  This  experiment 
yielded  a  phrase  rate  of  90%  and  a  word  rate  of  indicating  that  comparable 
recognition  results  are  achieved  by  averaging  transformed  parameters  than  by 
averaging  filter  parameters  and  then  applying  the  linear  transformation. 
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Table  5-2 

Experiments  Averaging  Linear  Discriminant  Parameters 
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5. 1.3  Conclusions 

The  preliminary  experiments  indicated  that  simply  using  cluster  centers  as 
templates  did  not  significantly  improve  recognition.  However,  averaging  proved  to 
be  effective.  Averaging  of  filter  bank  parameters  cut  phrase  recognition  errors  by 
30%  (performance  went  from  67%  to  76%),  and  averaging  the  speech  parameters 
were  even  more  effective  in  that  the  phrase  errors  were  cut  in  half  (performance 
went  from  84%  to  92%). 

An  unexpected  result  was  the  phrase  rate  of  92%  achieved  with  a  maximum 
clustering  threshold.  The  maximum  clustering  threshold  forced  the  software  to 
average  all  six  of  the  input  tokens  for  each  word  with  the  exception  of  those  infre¬ 
quent  cases  where  a  test  token  was  over  twice  or  less  than  half  the  length  of  the 
center  token.  In  this  experiment,  for  many  of  the  words,  quite  dissimilar  tokens 
were  averaged  together  yet  recognition  performance  did  not  suffer.  This  result  as 
well  as  the  general  insensitivity  of  the  process  to  the  clustering  threshold  is  prob¬ 
ably  due  to  the  limited  number  of  tokens  per  word. 
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5.2  Template  Extraction  from  Continuous  Speech 

A  word  template  produced  from  an  isolated  pronunciation  of  a  word  is  often 
an  inadequate  representation  of  how  that  word  appears  in  continuous  speech. 
However,  isolated  word  training  has  the  advantage  of  letting  a  new  speaker  quickly 
and  easily  train  the  system  Therefore,  our  approach  allows  a  speaker  to  initially 
train  the  system  by  reciting  the  word  vocabulary  (in  response  to  system  prompt¬ 
ing)  and  then  train  the  system  using  sentences  from  the  task  grammar.  Thus, 
both  isolated  and  continuous  versions  of  a  word  can  be  obtained  to  generate  more 
robust  templates. 

Associated  with  a  given  syntax  is  a  set  of  standard  phrases  which  are  con¬ 
structed  to  both  satisfy  the  node  sequence  of  the  grammar  and  to  contain  one  or 
more  occurrences  of  each  word  in  the  vocabulary.  During  training,  the  user  is 
prompted  to  speak  each  of  the  phrases  in  the  standard  phrase  set.  The  user  may 
also  construct  and  say  phrases  of  his  own  choosing  during  the  training  or  retrain¬ 
ing  phase.  When  the  extract  command  is  executed,  the  system  performs  what  is 
called  "forced  recognition",  for  each  phrase  which  has  been  spoken. 

"Forced  recognition"  limits  the  CSR  system  so  that  it  can  only  recognize  the 
sequence  of  words  spoken  In  the  phrase.  This  is  easily  obtained  by  automatically 
devising  a  syntax  that  allows  only  one  sequence  of  words,  i.e.,  the  words  of  the 
phrase  that  has  been  spoken.  The  algorithm  requires  that  an  existing  template  be 
available  for  each  word  in  the  known  phrase.  Forced  recognition  is  done  in  the 
digital  input  mode,  that  is,  the  speech  parameters  input  to  the  CSR  system  are 
read  from  a  phrase  file  which  was  created  during  training.  Following  the  forced 
recognition,  the  word  endpoints  found  by  the  DPA  matching  module  of  the  recogni¬ 
tion  algorithm  are  then  used  to  extract  word  patterns  from  the  parametric 
representation  of  the  phrase  and  these  patterns  are  output  as  word  templates. 
Before  extracting  the  parameters  for  a  word,  the  system  checks  the  word  scores  of 
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the  bounding  words  in  the  phrase  to  insure  that  they  are  below  a  threshold.  Tins 
boundary  test  insures  proper  alignment  for  the  extracted  word.  If  the  test  fails, 
the  word  is  rejected  and  not  used  in  the  template  averaging  process. 
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6.  DEVELOPMENT  TESTING 


The  goal  of  development  testing  was  to  identify  areas  of  recognition  algorithm 
improvement  by  exercising  the  CSR  system  and  to  determine  an  adequate  training 
technique  for  the  performance  test.  The  intention  was  that  at  the  conclusion  of 
development  testing,  the  CSR  algorithm  and  the  training  technique  would  be  esta¬ 
blished. 

6.1  Development  Data  Base 

Ten  of  the  50  data  base  subjects  were  selected  as  development  testing  speak¬ 
ers,  five  males  and  five  females.  Each  of  these  speakers  had  recorded  100  phrases 
from  the  100  word  airline  query  grammar.  These  100  phrases  were  divided  into  two 
sets,  one  to  be  used  as  development  test  material,  and  a  second  set  to  be  used  in 
the  final  performance  test  For  training  material,  each  speaker  had  recorded 
three  repetitions  of  the  100  word  vocabulary  (these  are  referred  to  as  "isolated ' 
tokens)  and  60  standard  phrases  available  for  template  extraction.  Development 
experiments  were  structured  so  that  the  50  development  set  p  irases  were  run 
versus  one  set  of  templates  for  each  of  the  ten  speaker.  However,  for  six  of  the 
speakers  one  phrase  was  eliminated  because  it  was  syntactically  incorrect.  Each 
development  experiment  thus  consisted  of  494  phrase  trials  of  average  length  7.3 
words. 

6.2  Function  Words. 

Six  words  of  the  100  word  airline  vocabulary  are  referred  to  as  function  words 
They  are  the  article  "the"  and  the  prepositions  "or,  "for",  "to",  "at",  and  "from". 
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These  words  have  a  special  role  In  the  vocabulary  and  were  the  target  of  specific 
training  techniques.  The  function  words  are  often  discussed  as  a  group  m  this  and 
the  remaining  chapters  rf  this  report. 

From  previous  experience  in  continuous  speech  recognition,  we  realized  that 
the  function  words  were  deserving  of  special  attention.  These  words  are  often 
unstressed  and  sometimes  dropped  completely  from  spoken  phrases.  Function 
words  are  also  often  significantly  colored  by  coarticulation.  Isolated  renditions  of 
these  words  are  of  little  value  as  templates  because  their  duration  ls  often  two  to 
three  times  loiter  than  function  word  duration  in  continuous  speech.  Thus  tem¬ 
plate  extraction  seemed  to  be  clearly  in  order  for  the  function  words. 

The  usage  of  three  of  the  function  words  in  the  100  word  airline  grammar  also 
is  worthy  of  discussion.  The  word  "the"  is  alwayB  an  optional  one  word  node  and 
may  or  may  not  be  Included  in  a  given  phrase.  The  words  "of”  and  "for"  constitute 
a  two  word  node  which  appears  in  two  separate  paths  of  the  syntax.  These  three 
words  never  affect  the  meaning  of  an  airline  query  phrase.  For  example  there  is 
no  semantic  difference  between  the  phrases  "Report  the  current  weather  of  San 
Diego”  and  "Report  current  weather  for  San  Diego”.  We  address  this  subject  here 
because,  in  following  chapters,  we  frequently  present  the  word  recognition  accu¬ 
racy  for  all  words  along  with  the  word  recognition  accuracy  excluding  "the",  "for", 
and  "or. 

6.3  Preliminary  Riper iments 

For  the  first  development  experiment,  template  sets  were  made  from  the  first 
vocabulary  repetition  for  each  non-function  word  and  from  an  extracted  template 
for  each  of  tbe  function  words.  For  the  s^oad  preliminary  experiment,  the  three 
vocabulary  repetitions  were  averaged  to  produce  cm  "averaged"  template  set  for 
each  speaker.  These  templates  were  then  employed  In  template  extraction  cf 
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three  tokens  for  each  of  the  function  words  which  were  then  averaged  and  substi¬ 
tuted  for  their  isolated  counterparts  in  the  averaged  t  unplate  sets.  The  recogni¬ 
tion  results  for  preliminary  experiments  are  presented  in  Table  6-1. 


Table  6-1 

Results  of  Preliminary  Experiments 
For  Ten  Development  Speakers. 

■404  Phrases  from  100  word  Airline  Vocabulary 


Template  Set 

Word 

Trials 

All  Word 
Word 
Rate 

.3 

Phrase 

Rate 

] 

M 

Word 

Trials 

Excludir 
the, for, i 
Word 
Rate 

of" 

Phrase 

Rate 

Single  tokens 

3633 

79.2 

24.8 

2864 

85.7 

53.8 

3  Avg.  Tokens 

3633 

84.9 

39.8 

2864 

88.7 

64.2 

3  Ave.  Tokens  /Silence 

3833 

86.4 

41.4 

2854 

90.6 

67.0 

Three  types  of  word  errors  may  occur  in  the  recognition  of  a  phrase,  substitu¬ 
tion  of  a  wrong  word  for  a  spoken  word,  deletion  of  a  spoken  word,  and  insertion  of 
a  word  which  was  not  spoken.  The  first  two  errors  cause  a  decrease  in  the  count  of 
correct  words  recognized.  The  insertion  error  is  noted  by  increasing  the  count  of 
total  word  trials.  Thus,  the  word  recognition  rate  is  computed  according  to  the  fol¬ 
lowing  formula: 

Word  Rate  =  (100  x  Correct-words)  /  (total-words  +  insertions) 

In  the  first  two  experiments,  it  was  noted  that  recognition  errors  were  often 
Introduced  whan  the  speaker  paused  in  the  midst  of  a  Long  phrase.  To  correct  this 
problem,  a  silence  template  was  included  in  template  storage  for  each  speaker. 
The  silence  template  is  tested  automatically  for  possible  Insertion  between  every 
word  of  the  phrase  and  at  the  end  of  the  phrase  by  the  recognition  algorithm.  As 
noted  in  Table  6-1,  the  silence  template  improved  overall  word  recognition  accu¬ 
racy  by  1.5  percent.  We  decided  to  include  the  silence  template  in  all  subsequent 
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experiments. 


The  experiment  with  three  averaged  templates  reduced  the  word  error  rate 
by  about  one-fourth  of  the  single  token  rate.  Excluding  "the”,  "for",  and  "of” 
errors,  the  word  recognition  rate  was  90.6%  versus  66.4  for  all  words.  Thus,  thirty 
percent  of  all  word  errors  were  the  insertion  or  deletion  of  "the",  or  the  confusion 
of  "for"  and  "of”. 


6.4  Comparison  at  Training  Techniques 

The  next  series  of  development  experiments  was  designed  to  evaluate  the  per¬ 
formance  of  extracted  versus  isolated  templates  in  the  recognition  process.  A  set 
of  66  phrases  had  been  recorded  by  each  speaker  for  the  purpose  of  template 
extraction.  This  set  includes  at  least  two  occurrences  of  each  vocabulary  word.  As 
described  in  Chapter  5  tokens  were  automatically  extracted  for  each  speaker  with 
the  recognition  algorithm  recognizing  the  phrases  in  a  forced  recognition  mode 
using  the  three  averaged  tokens  from  the  Anal  preliminary  experiment  as  tem¬ 
plates. 

Recognition  accuracy  was  then  compared  over  four  template  sets.  The  first 
template  set  was  comprised  of  the  three  isolated  averaged  tokens  used  in  the  prel¬ 
iminary  experiment.  A  second  template  set  was  created  by  averaging  two 
extracted  tokens  for  each  word.  The  third  experiment  employed  the  union  of  the 
first  and  second  template  Bets,  i.e.,  two  templates  for  each  word.  A  fourth  tem¬ 
plate  set  was  made  by  averaging  the  two  extracted  tokens  and  the  three  vocabu¬ 
lary  repetition  tokens  for  each  word.  Results  of  these  experiments  are  presented 
in  Table  6-8. 
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Table  6-2 

Results  of  Recognition  Experiments 
For  Ten  Development  Speakers. 

404  Phrases  from  100  word  Airline  Vocabulary 


Excluding 

All  Words 

"the 

for, of" 

Word 

Phrase 

Word 

Phrase 

#1 

Avg  Three  Isolated  Tokens 

66.4 

41.4 

90  6 

67.0 

#2 

Avg  Two  Extracted  Tokens 

87.7 

46.6 

90.7 

666 

#3 

Two  Templates/Word:  #1  and  #2 

91 J 

58.6 

95.8 

81.8 

L4i_ 

Ave  Five  Tokens 

90.0 

49.8 

93.8 

75.2 

The  results  indicate  little  performance  difference  between  averaged-isolated 
and  averaged-extracted  templates.  Results  with  two  templates  per  word 
significantly  improved  performance,  cutting  word  recognition  errors  almost  in  half 
when  the  three  function  word  errors  are  excluded.  However,  using  two  templates 
per  word  doubles  both  template  storage  and  processing  requirements  and  there¬ 
fore  is  unfeasible  in  light  of  the  reed  time  response  goal  for  the  LCSE  CSR  system. 
Word  accuracy  with  template  set  #4.  the  average  of  five  tokens  per  word,  was 
significantly  better  than  either  template  sets  §1  or  §Z  and,  since  set  #4  uses  only 
one  template  per  word,  it  appears  to  be  ^he  most  realistic  training  approach.  A 
comparison  of  the  word  rate  with  template  #4  and  with  the  single  token  template 
set  of  the  first  preliminary  experiment  (Table  6-1)  indicates  that  averaging  five 
tokens  cuts  the  overall  word  errors  in  half  (accuracy  changes  from  76.2%  to  90.0%). 
This  figure  is  entirely  consistent  with  the  word  error  reduction  on  the  digit  phrase 
task  in  the  template  averaging  study  presented  in  Chapter  5. 

In  Table  6-3,  we  present  the  word  error  rate  with  template  set  #4,  for  six  sub¬ 
sets  of  the  vocabulary.  The  table  clearly  shows  two  sources  of  recognition  errors, 
the  function  words  and  the  digit- decade- teen  group.  The  word  rate  on  1665 
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occurrences  of  the  other  68  words  in  the  vocabulary  was  97.4%.  Nearly  ail  of  the  fcS 
word  group  are  multi-syllable  words  including  airline  and  city  names  and  the  Z.Z 
word  alpha  set.  Averaging  of  tokens  taken  both  from  isolation  and  from  continuous 
speech  appears  to  be  quite  effective  for  the  multi-syllable  word  group. 

Table  6-3 

Categories  of  Word  Errors 
With  Averaged  Templates  From  Five  Tokens 
For  Ten  Development  Speakers. 


6.5  Modification  of  Template  Extraction  Algorithm 

In  an  effort  to  further  improve  performance,  we  examined  the  template 
extraction  algorithm  and  found  that  the  process  was  occasionally  producing 
extracted  tokens  with  faulty  endpoints.  To  overcome  this  problem  we  added  word 
score  testing  to  the  algorithm.  If  the  normalized  DPA  score  of  the  word  preceding 
and  following  the  word  to  be  extracted  were  below  a  threshold,  an  extracted  token 
was  output.  If  the  threshold  test  failed  for  either  bounding  word,  extraction  was 
not  performed.  The  word  score  test  was  intended  to  insure  that  the  speech  frames 
to  be  extracted  had  been  properly  aligned  by  the  dynamic  programming 


algorithm, 

A  final  development  testing  experiment  was  performed  using  the  modified 
extraction  algorithm.  In  this  experiment,  the  number  of  extracted  tokens  per 
word  was  not  limited  to  two  While  the  standard  set  of  66  phrases  contains  each 
vocabulary  word  at  least  twice,  commonly  occurring  words  appear  multiple  times 
In  some  cases,  as  many  as  nine  tokens  were  extracted  for  a  word  Cn  the  other 
hand,  the  word  score  test  caused  rejection  of  some  extracted  tokens  and  in  seme 
cases  there  were  no  extracted  tokens  available  for  the  average  template  for  a 
word.  Averaged  templates  were  generated  by  averaging  the  three  vocabulary 
repetitions  and  all  available  extracted  tokens  Results  of  the  final  development 
test  experiment  are  presented  m  Table  6-4. 
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Table  6-4 

Results  of  Final  Development  Test 
For  Ten  Development  Speakers. 

494  Phrases  from  100  Word  Airline  Vocabulary 


Speaker 

Number 

All  Words 

Excluding 

"the,  for,  of" 

Word 

Rate 

Phrase 

Rate 

Word 

Rate 

Phrase 

Rate 

03 

89.6% 

49.0% 

93.8% 

69.4% 

06 

89.7% 

49.0% 

94.6% 

77.6% 

11 

94.3% 

61  2% 

98.9% 

93.9% 

16 

85.5% 

32.7% 

93.4% 

73.5% 

21 

94.1% 

67,3% 

97.4% 

89.8% 

23 

94.5% 

85.3% 

96.9% 

83.7% 

31 

96.6% 

76.0% 

99.6% 

98.0% 

36 

96.3% 

76.0% 

99.0% 

96.0% 

41 

92.1  % 

62.0% 

95.7% 

82.0% 

43 

92.2% 

54.0% 

97.3% 

08.0% 

Average  -» 

92.5 

59.3 

96.7 

85.2 

Comparison  of  the  recognition  rates  in  Table  6-4  with  those  in  Table  6-2  for 
template  set  #4  shows  that  significant  improvement  was  obtained  by  the  modified 
template  extraction  algorithm.  Overall  word  rate  Improved  from  90.0%  to  92.5%, 
while  the  phrase  rate  Improved  from  49. B%  to  59.3%.  Word  rates  for  individual 
speakers  ranged  from  85.5%  to  96.6%. 
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6.6  Training  Technique  for  Performance  Testa 


After  evaluation  of  the  various  training  approaches  for  the  development 
speakers,  we  established  the  training  procedure  for  the  50  speaker  performance 
test.  The  final  approach  is  a  five-step  automatic  process  as  follows: 

1.  Average  the  three  vocabulary  repetitions  for  each  word  in  the  100 
word  vocabulary. 

2.  Using  the  templates  created  in  Step  1,  extract  multiple  tokens  for 
the  six  function  words  ("of,  for.  the,  to,  at,  from'1)  from  the  standard 
phrase  set  for  each  speaker. 

3.  Average  the  extracted  tokens  for  each  of  the  function  words. 

4.  Using  the  function  word  templates  from  Step  3  and  the  remaining 
templates  from  Step  1,  extract  multiple  tokens  for  all  words  from 
the  standard  phrase  set 

5.  For  each  word,  average  all  available  tokens,  thus  generating  the  tem¬ 
plate  to  be  used  in  the  performance  test.  For  non-function  words, 
the  available  templates  include  the  three  vocabulary  repetitions  and 
all  tokens  extracted  in  Step  4  For  function  words,  the  vocabulary 
repetitions  are  excluded  from  input  to  the  final  average  template. 
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7.  PERFORMANCE  TEST 


The  performance  tests  were  carried  out  on  a  data  base  of  50  random  phrases 
from  the  airline  query  grammar  spoken  by  the  50  data  base  subjects  Each  of  the 
subjects  also  spoke  a  training  set  consisting  of  three  repetitions  of  the  100  word  air¬ 
line  vocabulary  and  66  phrases  used  for  template  extraction.  A  single  averaged 
template  for  each  vocabulary  word  was  generated  for  each  speaker  according  to  the 
procedure  described  in  Section  6.6. 

7. 1  Performance  Test  Results 

A  summary  of  performance  test  results  is  presented  in  Table  7-1.  The  average 
word  recognition  rate  for  all  words  was  93.1%.  while  excluding  "of",  "for",  and  "the" 
errors,  the  average  word  rate  was  95  7%  The  average  phrase  recognition  rate  was 
64.6%  and  the  corresponding  rate  was  B3.0%,  when  "of",  "for",  and  "the"  errors  are 
ignored.  A  phrase  is  considered  correct  if  all  words  in  the  phrase  are  correctly 
identified  Note  that  of  the  884  phrases  which  were  recognized  incorrectly,  in  459 
cases,  the  only  error  was  the  confusion  of  "of”  and  "for"  or  the  deletion  or  insertion 
of  "the". 

On  each  recognition  trial,  the  CSR  system  reports  five  candidate  recognition 
results  ranked  by  score.  In  Table  7-1  the  rows  labelled  "OPTION"  show  the  number 
of  times  each  candidate  was  the  correct  result.  The  line  labelled  "OPTION  |2"  indi¬ 
cates  that  in  310  cases  (12.4%),  the  CSR  system’s  second  choice  was  the  correct 
phrase. 
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Table  7-1 

Summary  of  Performance  Test  Results 
100  Word  Airline  Grammar 


50  Speakers  -  50  Phrases  per  Speaker 


ALL  WORDS 

EXCLUDING 
"of.  for.  the" 

PHRASE  TRIALS  =  2500 

CORRECT  =  1616  64.6% 

OPTION  #2  =  310  77.0% 

OPTION  #3=  75  80.0% 

OPTION  #4  =  38  81.5% 

OPTION  #5  =  24  82.5% 

MEDIAN  PHRASE  RATE  68.0% 

2075  83.0% 

149  B9.0% 

35  90.4% 

12  90.9% 

12  91.3% 

84.0% 

WORD  TRIALS  =  18416 

CORRECT  =  172B4 

INSERTIONS  =  144 

DELETIONS  =  358 

WORD  RATE  =  93.1% 

MEDIAN  WORD  RATE  = _ S±£2 _ 

14873 

14307 

77 

61 

95.7% 

_ .  - 

A  complete  tabulation  of  phrase  and  word  recognition  results  by  individual 
speaker  is  included  in  Appendix  C.  Figure  7-1  presents  a  histogram  of  word  rates 
and  phrase  rates  for  all  50  speakers.  The  median  of  the  distribution  of  overall  word 
rates  is  94.5%.  The  median  figure  is  1.3%  higher  than  the  overall  average  word  rate 
for  all  speakers.  The  median  provides  a  better  estimate  of  the  expected  perfor¬ 
mance  of  an  unknown  speaker,  because  as  Figure  7-1  shows,  the  average  word  rate 
la  lowered  significantly  by  a  small  group  of  poor  performing  speakers.  Excluding 
”of'\  "for",  and  "the"  errors,  the  median  rate  is  96.7%,  one  percent  higher  than  the 
corresponding  average  word  rate. 

In  Table  7-2,  we  present  a  summary  of  word  recognition  rates  according  to  the 
sex,  age,  and  educational  background  of  the  speakers.  The  overall  average  rate  of 
female  speakers  exceeded  that  of  males  by  .6%.  The  age  summary  suggests  older 
speakers  perform  better  than  younger  while  educational  background  seems  to  have 
no  correlation  with  word  rate  performance. 
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Table  7-2 

Summary  of  Word  Recognition  Rates  by  Speaker's 
Sex,  Age  and  Educational  Background. 


SEX 


Male  Female 

92.7%  93.5% 


AGE 


Teens-Twenties  Thirties  Forties-Fifties 


Number  of  Speakers 

28 

15 

7 

Overall  Word  Rate 

92.5% 

93.4% 

94.8% 

EDUCATION 

High 

Junior 

Bachelor 

MS 

School 

College 

Degree 

Degree 

Ph.D 

Number  of  Speakers 

8 

18 

16 

7 

3 

Overall  Word  Rate 

93.3% 

92.7% 

94.0% 

91.7% 

93.43 

7.2  Categories  of  Performance  Test  Errors 

Table  7-3  presents  word  recognition  rates  for  six  subgroups  of  the  100  word 
vocabulary.  These  figures  show  that  the  CSR  system  has  particular  difficulty  identi¬ 
fying  the  function  words,  the  digits  and  the  decades.  The  function  words  are  fre¬ 
quently  deemphastzed  in  continuous  speech  while  the  digits  and  decades  are  fre¬ 
quently  confused  with  each  other.  The  word  rate  for  the  66  word  group,  which 
makes  up  the  majority  of  the  word  trials  is  98  4%.  63  of  the  66  words  in  this  group 
are  multi-syllabic. 
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Table  7-3 

Word  Recognition  Rate  by  Vocabulary  Subgroups 


Cateeorv 

Trials 

Word  Rate 

"of,  for,  the" 

3543 

82.4% 

"to,  at,  from" 

1738 

92.5% 

"digits" 

2458 

89.9% 

"teens" 

441 

94.2% 

"decades" 

416 

82.8% 

"68  remaining  words" 

9620 

96.4% 

All  Words 

18418 

93.1% 

7.3  Performance  Teat  With  62  Word  Vocabulary 

In  order  to  gain  insight  as  to  the  performance  of  the  CSR  system  on  a  less  chal¬ 
lenging  syntax,  another  performance  test  was  designed  and  conducted  with  an 
reduced  vocabulary.  The  decade  and  teen  nodes  were  eliminated  from  the  syntax, 
thus  reducing  the  vocabulary  to  82  words.  The  templates  for  each  speaker  were  the 
same  as  those  used  in  the  100  word  test.  The  test  phrases  were  a  subset  of  those 
used  in  the  100  word  experiments  and  were  obtained  by  eliminating  all  airline 
phrases  containing  a  teen  or  decade  word.  This  reduced  the  average  number  of  test 
utterances  per  speaker  from  50  to  32.  Table  7-4  shows  a  comparison  of  perfor¬ 
mance  on  this  subset  of  the  airline  phrases.  Results  are  given  first  with  the  100 
word  vocabulary,  including  the  decades  and  teens  and  then  with  the  reduced  82 
word  vocabulary. 
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Table  7-4 

Comparison  of  Performance  of 
100  Versus  82  Word  Airline  Vocabulary 

50  speakers 
1650  Phrase  Trials 


mm 

All  Words 

mmm 

Median 

Median 

Median 

Median 

Digit 

Phrase 

Word 

Phrase 

Word 

Word 

Rate 

Rate 

Rate 

Rate 

100  Words 

71.0% 

94.6% 

84.9% 

96.9% 

89.8% 

82  Words 

74.2% 

95.8% 

90.5% 

96.  QX 

94.2% 

When  the  vocabulary  was  reduced  from  100  to  82  words,  the  median  word  rate 
improved  from  94.0%  to  95.8%.  Excluding  ’’of',  for",  and  "the"  errors,  note  that  a 
98.0%  word  rate  (or  2.0%  error  rate)  was  achieved  on  the  B2  word  test  while  the 
word  error  rate  was  3. 1%  for  the  100  word  vocabulary.  Elimination  of  the  decades 
and  teen  reduced  word  errors  by  35.0%. 

The  corresponding  figures  for  phrase  rate  errors  (excluding  "of",  "for",  and 
"the")  were  9.5%  and  15.1%.  TTius,  the  number  of  phrase  errors  declined  37%  with 
the  reduced  vocabulary.  The  82  word  experiment  demonstrates  the  impact  that 
syntax  and  vocabulary  selection  can  have  on  the  performance  or  a  CSR  system. 

The  right  hand  column  of  Table  7-4  shows  the  word  recognition  rates  for  the  ten 


digits.  In  the  82  word  experiment,  the  digits  could  not  be  confused  with  teen  and 
decade  words  and  43%  of  the  digit  word  errors  were  eliminated. 


B.  ERROR  ANALYSIS 


Error  analysis  in  the  context  of  this  report  is  an  attempt  to  classify  the  causes 
of  the  recognition  errors  made  by  the  CSR  system  in  performance  tests.  The  data 
base  for  the  error  analysis  is  made  up  of  those  performance  test  phrases  which 
were  incorrectly  identified  Errors  on  the  semantically  irrelevant  function  words 
"for",  "of",  and  "the"  were  ignored  in  this  study  There  were  425  of  the  2500  perfor¬ 
mance  test  phrases  (i.e.,  about  17%)  that  contained  word  errors  other  than  "for", 
"of",  and  "the".  These  425  airline  query  phrases  constitute  the  data  base  for  error 
analysis  in  this  report. 

B.1  Preliminary  Analysis 

The  first  step  was  to  gather  statistics  on  word  and  phrase  trials,  insertions, 
deletions,  and  substitutions  for  individual  speakers  and  for  the  group  of  50  speak¬ 
ers.  Word  accuracies  were  compiled  for  various  subsets  of  the  100  word  vocabulary 
such  as  function,  words,  digits,  teens,  and  decades.  Many  of  these  statistics  were 
presented  in  the  tables  of  Chapter  7.  A  more  detailed  account  of  individual  word 
errors  in  the  vocabulary  subgroups  is  presented  in  Table  B-l.  The  function  words 
aside,  the  most  commonly  misrecognized  words  were  the  digit  "eight"  which  varies 
according  to  the  presence  of  the  stop  release  and  the  decade  "seventy"  which  is 
often  mistaken  for  the  digit  "seven". 

A  list  of  all  phrases  in  error  was  compiled  and  for  these  phrases,  the  scores  and 
endings  of  individual  words  were  obtained.  Data  was  also  gathered  for  each 
speaker's  templates.  This  data  included  duration  of  the  averaged  template,  the 
number  of  isolated  and  extracted  tokens  which  made  up  the  averaged  cluster,  and 
whether  the  cluster  center  token  was  isolated  or  extracted. 
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Tabla  t-l 

Racognitlon  Stattrtlca 
*y  Individual  Word 


Occ 

lno  Dot 

i  Sub  Sob*  Rato 

Occ 

In* 

Dol 

Sub 

Sob* 

Sato 

tho 

1621 

70 

2»* 

14 

0 

79. 6S 

raport 

369 

0 

2 

7 

2 

*7.6* 

for 

»» 

1 

0 

121 

13* 

07 . 6% 

what-  la 

360 

0 

0 

• 

• 

97.8* 

of 

10S  7 

0 

0 

1 36 

127 

•  7.2* 

to  U-oo 

361 

0 

1 

1 

2 

99 .41 

TOTALS 

mi 

7 1 

2d* 

27# 

206 

•2.41 

9 t vo-oo 

303 

0 

0 

1 

7 

99. 7X 

am va  1  -t  too 

210 

0 

0 

0 

1 

100. 0X 

doporturo-t  too 

266 

0 

0 

2 

0 

99.2S 

Occ 

In* 

Do  1 

Sub 

Sub* 

Sato 

locat ton 

249 

0 

0 

0 

l 

100.0X 

f  roo 

SdS 

* 

i 

36 

61 

96.61 

otatuo 

269 

0 

0 

0 

2 

100. 0X 

to 

••7 

2 

2 

70 

24 

91.91 

f 1 i0ht-achodul 

406 

0 

0 

1 

3 

99. 0X 

•t 

4» 

3 

1 

12 

10 

71.11 

a  1 rcraf t-typo 

266 

0 

0 

2 

0 

99. 2X 

TOTALS 

I7JS 

ts 

4 

117 

•6 

S2.6S 

paaaongor- load 

271 

# 

0 

0 

0 

100. 0X 

d 1 at.nca 

160 

0 

0 

0 

0 

100.01 

f 1 tflht-t  too 

169 

0 

0 

2 

1 

90. 0X 

Occ 

Ini 

i  0*1 

Sul 

>  Sub* 

Rato 

currant 

1*3 

1 

0 

0 

2 

100. 0X 

loro 

ts 

2 

d 

4 

2 

66. at 

woathor 

I9d 

0 

0 

1 

0 

99.61 

ono 

276 

1 

3 

id 

IS 

96.31 

f orocaat 

143 

0 

0 

0 

0 

100. 0X 

two 

321 

Id 

6 

13 

19 

94.13 

flight 

1160 

0 

1 

IS 

13 

90. 4X 

throo 

27* 

1 

4 

17 

23 

92. 4X 

nuobor 

200 

1 

0 

7 

1 

97. 8X 

four 

2S* 

2 

0 

26 

12 

90.21 

a  I  re  raft 

24  a 

0 

0 

0 

7 

100.01 

r  ivo 

212 

3 

I 

10 

• 

91  .Ml 

hundrod 

•6 

2 

2 

0 

1 

97. 7S 

•  4* 

273 

1 

2 

19 

4 

92.3* 

nat  tona 1 

id 

0 

0 

1 

1 

99. 6X 
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2 
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16 

00. 0X 
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d 

0 

d 

ids.** 
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12 

16 

42 
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76.7* 
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1  1 
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0 

0 

0 

1 
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34 
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0 

0 

1 

2 
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186 

0 

l 

• 

1 

*6.6* 

aoo r tcan 

187 

0 

0 
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2 

99. 4X 
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Dol 

Sub  Sub* 
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124 

0 

0 

ft 
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0 
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32 
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1 
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32 

d 

d 

8 

1 

•  1 .33 

Occ 

Inc 

Dal 

Sub 

Sub* 

Rato 

bravo 
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d 
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2 
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16 
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8.2  listening  Procedures 

Two  modes  of  listening  were  employed  in  error  analysis.  In  the  first  mode,  an 
audio  signal  was  synthesized  from  the  filter  coefficients  contained  in  test  utterance 
files.  This  mode  was  valuable  for  verifying  the  end  point  detection  process.  The 
second  listening  mode  was  to  playback  the  recorded  audio  tape  for  individual 
speakers  for  purposes  of  phonetic  comparison.  Listening  procedures  were  used 
only  for  those  phrases  containing  multiple  word  errors  or  suspected  end  point 
detection  problems. 

B.3  Krror  Classification  and  Coding 

Table  0-2  contains  a  list  of  ten  classes  of  CSR  recognition  errors  including  a 
class  for  errors  whose  cause  is  unknown  and  a  claBS  for  errors  whose  cause  requires 
further  analysis.  The  ten  classes  often  contain  subgroupings  which  identify  the 
cause  of  the  error  more  specifically. 

The  phonetic  similarity  class  applies  to  those  words  which  were  mistakenly 
identified  as  a  similar  sounding  word(s).  The  end  point  detection  class  contains 
those  errors  in  which  the  CSR  system  did  not  properly  detect  the  beginning  or  end 
of  the  phrase.  Template  generation  classifies  those  errors  which  were  clearly  due 
to  an  inadequate  template  for  the  word.  Pronunciation  errors  are  a  self- 
explanatory  class. 

Error  classification  #4  is  labelled  "pause  in  unknown  for  multisyllable  word". 
The  CSR  algorithm  is  adept  at  handling  pauses  between  words,  but  may  make  an 
error  when  a  pause  occurs  within  a  word  such  as  in  "con-(pause)-tinental". 

The  "gap  between  words"  category  applies  to  pauses  between  words  which  are 
too  short  to  be  recognized  by  the  silence  template.  This  type  of  error  occurs  most 
frequently  in  the  "digit-decade-teen"  portion  of  the  airline  syntax.  The  "1  to  N” 
category  refers  to  recognition  errors  such  as  "seventeen"  being  identified  as  "seven 
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Table  02 

TABLE  OF  CODES  FDR  CAUSES  OF  CSR-SYSTEM  ERRORS 


0  CAUSE  UNKNOWN  0  =  53 

1  PHONETIC  SIMILARITY  1  =  315 

a.  Whole-word  similarity  (five /nine) 

b.  Partial-word  similarity  (sixteen /six) 

c.  Word  boundary  crossing  (seventeen/seven  ten) 

d.  Coarticulation  (two  eight/three) 

2  ENDPOINT  DETECTION  2=8 

a.  Beginning  of  sentence 

b.  End  of  sentence 

3  TEMPLATE  GENERATION  3  =  145 

a  Isolated  template  too  long— no  extracted  tokens 

b.  No  extracted  tokens  for  some  other  reason 

c.  Cluster  center  has  extreme  duration 

d.  Variation  in  final-stop  release 

e.  Difference  in  stress  patterns  between  isolated  and  extracted  tokens 

f.  Difference  in  intonation  between  isolated  and  extracted  tokens 

g.  Template  is  too  short 

h.  Template  is  too  long 

4  PAUSE  IN  UNKNOWN  FOR  MULTISYLLABLE  WORD  4  =  9 

5  GAP  BETWEEN  WORDS  5  =  56 

a.  digits,  teens,  or  decades 

b.  Other  words 

6  PRONUNCIATION  ERROR  6  =  57 

a.  Dropping  amplitude  at  end  of  sentence 

b.  Pause 

c.  Stuttering 

d.  Excessive  reduced  duration 

e.  Excessive  extended  duration 

7  1-TO-N  OR  N-TO-1  ERROR  7  =  86 

a.  n  to  1 

b.  1  to  2 

c.  1  to  3 
etc. 

8  PROPAGATION  ERRORS  B  =  153 

a.  Adjacent-word  error  and  syntax  constraint 

b.  Nonadjacent-word  error  and  syntax  constraint 

c.  Adjacent  word  has  bad  word  boundary 

d.  Illegal  syntax 

6  FURTHER  ANALYSIS  REQUIRED  9=111 

Total  =  639 
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ten"  while  the  "N  to  1.”  category  is  the  reverse. 

The  classification  "propagation  error"  contains  those  word  errors  caused  by 
the  misrecogmtion  of  a  previous  word  in  the  sentence.  This  type  of  error  results 
from  the  use  of  syntax  to  restrict  the  potential  word  candidates  in  various  parts  of 
the  sentence.  Occasionally,  if  a  key  syntactic  word  is  misidentifled  in  the  phrase, 
the  proper  templates  are  not  matched  with  the  speech  following  this  word. 

To  illustrate,  consider  the  phrase  "Tell-me  the  status  of  aircraft  alpha  bravo 
thirty  two".  In  the  airline  query  grammar,  the  words  in  the  "alpha"  node  must  be 
preceded  by  the  word  "aircraft".  If  "aircraft"  is  not  identified,  the  words  "alpha" 
and  "bravo"  will  also  be  misidentifled  Errors  such  as  "alpha"  and  "bravo"  in  the 
above  example  would  be  classified  as  propagation  errors  while  "aircraft”  would  be 
placed  in  another  error  category. 

B.4  Results  of  Error  Analysis 

Certain  word  errors  cannot  be  ascribed  to  a  single  cause.  For  example,  a  word 
may  be  identified  as  a  similar  sounding  word,  but  may  also  have  an  Inadequate  tem¬ 
plate  or  may  cause  a  one-to-N  or  N-to-one  type  error.  Thus,  there  is  not  a  one-to- 
one  correspondence  between  word  errors  and  error  classifications. 

The  data  base  of  425  phrases  with  errors  contained  639  individual  word  errors. 
A  summary  of  the  classification  of  these  errors  is  presented  in  Table  8-2.  The 
phonetic  similarity  class  contains  315  errors.  Thus,  for  nearly  half  of  the  words  in 
error,  the  CSR  system  recognizes  a  similar  sounding  word  or  words. 

The  second  moBt  common  error  category  is  propagation  errors.  In  these 
cases,  correction  of  the  source  of  the  propagation  error  would  likely  correct  multi¬ 
ple  errors.  The  third  most  common  category  is  template  generation.  Improved 
training  techniques  may  correct  many  of  these  word  errors  and  errors  in  other 
classifications  as  well. 
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The  most  challenging  aspect  of  the  100  word  airline  query  grammar  ts  recog¬ 
nizing  the  phonetically  similar  digit,  teen,  and  decade  words,  A  confusion  matrix  of 
errors  between  these  words  is  presented  m  Table  8-3.  Propagation  errors  are  not 
included  in  this  tabulation.  The  rows  in  Table  6-3  correspond  to  the  intended  word 
while  tha  columns  represent  the  mistakenly  recognized  word.  At  the  far  right  of  the 
table  is  a  column  containing  the  number  of  word  trials,  the  number  of  times  each 
word  appeared  in  a  performance  test  phrase.  The  diagonal  boxes  in  the  table 
denote  the  confusions  of  each  word  with  its  phonetically  similar  counterpart,  for 
example,  "four"  with  "forty"  or  "thirty"  with  "three".  The  confusion  of  "seven"  as 
"seventy"  occurred  23  times  in  the  performance  test  while  "eight"  was  confused 
with  "eighty"  on  17  occasions.  The  clustering  of  the  confusions  on  the  diagonals 
clearly  demonstrates  the  difficulty  of  recognizing  competing  similarly  sounding 
words. 

Table  8-4  presents  a  tabulation  of  the  number  of  words  in  error  for  each  of  the 
2500  phrases  of  the  performance  test.  2085  phrases  had  no  errors  while  294  had 
just  one  error.  The  second  line  of  this  table  presents  the  data  where  propagation 
errors  are  ignored.  Propagation  errors  aside,  in  85%  of  the  cases  where  the  CSR  sys¬ 
tem  misrecognized  a  phrase,  it  misrecogmzed  only  one  word  in  the  phrase.  This 
indicates  that  CSR  system  word  errors  are  somewhat  independent  with  the  excep¬ 
tion  of  propagation  errors,  of  course. 
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Table  &-4 


Number  Of  Word  Errors  Per  Phrase 


0 

1 

2 

3 

4 

5  or  more 

WITH  PROP.  ERRORS 

2085 

294 

15 

WITHOUT  PROP.  ERRORS 

2085 

346 

5 

0 
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8.  CONCLUSIONS 


In  performance  of  the  Limited  Connected  Speech  Experiment,  ITTDCD  has 
demonstrated  that  a  one  hundred  word  vocabulary  can  be  processed  in  real  time 
with  high  word  recognition  accuracy  while  concurrently  providing  a  flexible  'Voice- 
control''  feature  which  allows  the  user  to  control  the  CSR  system  via  Bpoken  com¬ 
mands  of  his  own  choosing. 

Three  major  conclusions  from  this  contract  are  presented  below. 

1.  The  temi  ate  averaging  study  discussed  in  Chapter  5  showed  that  a 
speaker  dependent  CSR  system  with  storage  and  processing  limita¬ 
tions  achieves  better  performance  with  a  set  of  averaged  tokens  than 
with  any  individual  set  of  tokens. 

2.  Performance  of  the  CSR  algorithm  is  extremely  dependent  on  the 
quality  of  word  templates.  In  the  initial  development  experiment  dis¬ 
cussed  in  Chapter  6,  a  word  rate  of  65.7%  and  a  phrase  rate  of  53.6% 
was  obtained  with  Isolated  single  word  templates.  For  the  final 
development  experiment  the  corresponding  figures  for  the  same  test 
phrases  and  speakers  were  86.7%  and  65.2%.  The  recognition  algo¬ 
rithm  was  the  same  for  both  experiments  but  the  training  technique 
was  modified 

3.  CSR  performance  on  a  given  task  will  vary  sharply  according  to  the 
structure  of  the  task  synta"  and  the  choice  of  the  task  vocabulary  In 
the  100  word  versus  62  word  experiment  described  In  section  7.3,  a 
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median  word  rate  of  94.6%  was  achieved  for  all  words  on  the  100  word 
test.  For  the  same  test  phrases  and  speakers  on  the  B2  word  test,  the 
median  word  rate  was  9B.0%  (excluding  "of.for.the"  errors).  A  reduc¬ 
tion  in  word  errors  of  63%  was  thereby  achieved  by  restructuring  the 
syntax,  removing  phonetically  similar  words  from  the  vocabulary,  and 
Ignoring  semantically  irrelevant  errors. 


-44- 


APPENDIX  A 


CONNECTED  SPEECH  RECOGNITION  SYSTEM 
USER'S  GUIDE 


User  Interface 


That  portion  of  the  VAX- 11  /780  CSS  system  which  is  visible  to  the  user  is  called 
the  user  interface  Through  this  interface,  the  user  invokes  the  css  program  and 
Indicates  to  it  the  type  and  order  of  operations  to  be  performed. 

1.  Program  Invocation 

The  VAX-1 1/780  CSS  system  is  much  like  any  other  UNIX®  applications  program. 
It  can  be  executed  any  time  when  the  user  is  at  the  shell  level,  although  it  is  not 
currently  linked  to  either  /bin  or  /usr/bin  This  requires  that  an  absolute  path¬ 
name  be  used  to  start  up  the  program.  That  pathname  currently  is 

/usrl  /a/sr/bm/csr 

The  CSR  system  accepts  optioned  switches  at  invocation  which  affect  its  process¬ 
ing.  The  syntax  of  the  program  call  is 

csr  [-ceistv]  [extension  [comm.ancLJUe  ...]  ] 

(This  assumes  that  a  chdvr(l)  to  the  resident  directory  has  been  done  previously.) 
The  switches  include 

-c  Continue  processing  following  fatal  errors.  This  is  the  default  setting  Be 
careful,  however,  since  fatal  errors  can  cause  spurious  results  to  occur 
This  switch  is  mainly  intended  for  program  testing  and  short  program 
runs  without  command  flies  Long  experiments  should  use  the  -t  switch 
(see  below). 

-e  Use  the  next  program  argument  as  an  extension  when  forming  both  the 
DPA  analysis  and  experiment  results  pathnames.  In  both  cases,  the 
extension  (or  as  much  of  it  as  will  fit)  is  appended  onto  the  end  of  the 
formed  pathname.  Keep  in  mind  that  UNIX  pathnames  are  limited  to  14 
characters 

-1  Open  a  file  with  the  name  .csrrc  in  the  current  directory  and  perform  its 
commands  as  if  they  had  been  entered  from  the  keyboard.  This  is  useful 
for  performing  repetitive  initialization  sequences.  Normal  processing 
continues  following  end-of-fiie  on  the  initialization  file. 

-s  Operate  m  silent,  rather  than  verbose,  mode.  Normally,  each  CSR  com¬ 
mand  displays  terse  information  concerning  its  execution.  This  informa¬ 
tion  can  be  suppressed  by  specifying  this  switch.  Fatal  errors  and  pro¬ 
gram  diagnostics  cannot  be  suppressed,  only  informative  messages. 

-t  Terminate  on  first  occurence  of  a  fatal  error.  This  should  always  be  used 
for  long  experiment  runs,  since  an  error  can  result  in  spurious  results 
from  that  point  on. 

-v  Operate  in  verbose,  rather  than  silent,  mode.  This  is  the  default  setting 
Normally,  each  CSR  command  displays  terse  inforamtion  concerning  its 
execution.  This  information  is  written  to  the  user's  terminal  or  to  the 
experiment  results  file  (if  experiment  mode  is  on).  Fatal  errors  and  pro¬ 
gram  diagnostics  are  always  written  to  the  standard  output  unit  and  can¬ 
not  be  suppressed  or  redirected  without  use  of  the  shell’s  redirection 
facilities. 

If  command  flle(s)  are  specified  on  the  invocation  line,  the  program  automati¬ 
cally  terminates  when  the  end  of  the  last  command  file  is  reached,  provided  no  fatal 
errors  or  other  terminating  conditions  were  encountered.  Otherwise,  the  program 
will  prompt  the  user  for  valid  commands  and  terminate  upon  entry  of  the  quit  com¬ 
mand. 
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2.  Program  States 

While  the  VAX-1 1/780  CSR  system  is  running,  it  la  in  one  of  three  distinct  states 
or  modes 

•  Command  mode  is  the  normal,  default  state  of  the  CSR  system.  In  this 
mode,  commands  are  read  from  either  the  keyboard  or  specified  com¬ 
mand  files  Each  recognition  trial  must  be  explicitly  performed  by  issuing 
the  recognizespeech  command  Whenever  an  interrupt  is  caught,  the  sys¬ 
tem  returns  to  this  state,  regardless  of  what  state  it  was  in  before  the 
interrupt 

•  Ongoing  Recogmtion  mode  is  the  program  state  in  which  recognition  trials 
occur  one  after  another  with  no  intervening  user  action.  As  soon  as  the 
results  of  one  recognition  trial  are  displayed,  the  system  is  listening  for 
the  next  unknown  utterance  This  mode  is  entered  from  the  Command 
mode  by  setting  the  Single~recog  switch  off. 

The  user  may  exit  to  the  Command  mode  by  hitting  an  interrupt 
(labeled  DSL  or  KUBOUT  on  some  terminals)  or  to  the  Voice  Control  mode  by 
saying  the  isolated  word  control  While  in  Ongoing  Recognition  mode, 
there  is  no  input  read  from  the  terminal  or  any  open  command  file(s). 

•  Voice  Control  mode  is  made  the  current  state  if  the  user  says  the  isolated 
word  control  or  types  it  in  while  m  Command  mode  or  if  the  isolated  word 
control  is  spoken  while  in  Ongoing  Recognition  mode.  A  small  subset  of  the 
commands  that  the  user  has  available  in  Command  mode  are  available  in 
Voice  Control  mode.  The  main  difference  is  that  while  in  Voice  Control 
mode  all  commands  are  spoken,  rather  than  typed  in  from  the  keyboard. 
Typical  commands  are  isolated  words  that  require  no  arguments. 

The  user  may  exit  to  the  Command  mode  by  hitting  an  interrupt  or 
by  saying  offline  Similarly,  saying  recognize  switches  the  system  into 
Ongoing  Recogmtion  mode.  While  m  Voice  Control  mode,  there  is  no  input 
read  from  the  terminal  or  any  open  command  flle(s). 

NOTICE:  Although  being  in  Voice  Control  mode  requires  ongoing 
recognition,  it  is  quite  different  from  the  Ongoing  Recognition  mode  in 
that  only  the  control  syntax/vocabulary  is  active.  Ongoing  Recogmtion 
mode,  on  the  other  hand,  generally  has  the  non-control  syntax /vocabulary 
active  (with  the  exception  or  the  meta-control  words  cancel  and  control). 

3.  Commands 

3.1.  Command  List 

The  list  of  valid  CSR  commands  includes  commands  which  are  allowed  only  from 
the  keyboard  (preceeded  by  the  t  symbol),  allowed  only  from  a  command  file  (pre- 
ceeded  by  the  $  symbol)  and  those  allowed  both  from  the  terminal  and  spoken  in 
Voice-control  mode  (preceeded  by  the  •  symbol).  Valid  commands  from  the  key¬ 
board  are  entered  in  response  to  the  system's  prompt  (which  is  initially  but 
can  be  changed  under  user  control).  All  command  names  can  be  abbreviated  to  as 
few  characters  as  are  necessary  to  guarantee  uniqueness.  For  example,  in  the  list 
below  sp,  speak  and  speaker  are  all  valid  names  for  the  same  command.  However, 
specifying  a  command  name  of  s  is  ambiguous  because  save-speech, 
seL-Bnviranment,  signal  and  summary,  as  well  as  speaker  all  begin  with  the  letter  s. 


analysis  [off  |  on  [ ancdysis_Tesults_pathnam.e ]  ] 
average  [averager_arguments  ] 
banner  [one_iine_messape] 
change-directory  [new_workvng_directory] 
dear-template  ward,  [directory] 
command-file  coTnmand_file_pathname 

•  control 

•  display 

t  document  [commanaLname  . . . } 
experiment  [off  |  on  [experim.ent_Tesults_pathnam.e]  ] 
extract  [fct]  [extractor^arguments] 
t  help  [commancLname  ] 
live-experiment  sta.ndarcLphTose_pathTia.me 
load-syntax  syntax:_jspecificatioTi_pathname 
load-templates 

•  offline 
■  options 

phrase-training  [starting_phrase_number]  [output-directory] 

pad 

quit/~l)  (control'D) 

•  recognize  [digitaLjunkTwwnpmihnanie  ] 

~eset— environment  [  vanab  Le_na.me  .  ..] 
save-speech  av.tjmt_pathna.Tne 
set-environment  [va riablejname  newuvaLue]  ... 
signal  UNDLsig naJUname  [off  |  on] 

speaker  initials 
summary  [tme_Jine_jnessaye] 
task  task_name 

train-system  [utterance^string  [output_jtvrectory]  ] 

$  tty_input 

Unix  [  C-SheLL-camrrumd ] 
version 

vocabulary-training  [ b egmning_ujord]  [repetitions: aunt] 

•  word-scores 

!  [  C_shelL_command] 

#  [  anej.in.e_x:  a  mxnent  ] 

3.2.  Command  Descriptions 

3.2.1.  analysis  [off  |  on  [analysis-results-pathname] 

This  command  changes  the  current  state  of  analysis  processing  When  turned 
on,  detailed  recognition  information  that  includes  DPA  distances,  directions  of 
movement  through  the  DPA  matrix  and  frame-by-frame  recognition  scores.  In 
either  the  off  or  on  setting,  analysis  processing  does  not  affect  the  recognition  algo¬ 
rithm. 

When  used  with  no  arguments,  the  current  state  is  toggled  (from  off  to  on,  or 
vice-versa).  If  only  one  argument  is  present,  it  must  be  either  the  string  off  or  on. 
When  turning  analysis  mode  on  in  this  case,  the  previously-used  or  default  filename 
is  opened  as  the  output  file.  The  default  pathname  is  of  the  form 

/tmp/ AlUl  ,pppppp[  ee  ] 

where  "Ull"  is  the  first  four  letter  of  the  user's  logon  name,  "pppppp"  is  the  zero- 
filled,  six-digit  process  ID  number  (of  the  object  program),  and  "ee"  is  an  optional 
extension  that  the  user  specified  at  program  invocation  by  using  the  -e  switch.  If 
the  user  specifies  a  pathname  when  turning  the  analysis  mode  on,  that  pathname 
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takes  precedence  over  the  default  name 

3.2.2.  average  [averager-arguments  j 

The  average  "ommand  executes  the  template  averager  and.  with  no  passed 
arguments,  produces  averaged  templates  in  the  speaker  s  a.02  directory  T!.e 
averager  only  uses  those  flies  that  have  been  created  by  training  since  the  l^st 
averager  run  Any  arguments  following  the  command  name  are  passed  in  to  the 
template  averager.  Since  this  command  executes  another  to  program  via 
fork /exec,  it  could  take  a  few  minutes  to  finish 

3.2.3.  banner  [one_iine_messape] 

The  banner  command  allows  the  user  to  conspicuously  display  a  single  line  of 
text  on  the  standard  output  unit  The  message  is  preceeded  and  followed  by  three 
pound  signs  (#)  and  the  message  is  optional  This  command  is  useful  for  informing 
the  user  of  events,  such  as  the  completion  of  recognition  trials  for  a  speaker. 

3.2.4.  c hang e_di rectory  [nevj-war)cirLg_itirectory] 

This  command  allows  the  user  to  change  the  program's  concept  of  its  current 
working  directory.  This  is  useful  since  speech  files  reside  in  many  different  direc¬ 
tories.  The  full  power  of  the  shell's  chdir(l)  command  is  supported.  If  no  argument 
is  specified,  the  directory  which  the  user  was  in  when  the  CSR  system  was  invoked  is 
returned  to.  If  an  argument  is  present,  it  is  a  pathname  of  a  directory  to  change  to. 
The  specified  pathname,  and  those  higher  in  the  hierarchy,  must  be  searchable  by 
the  user  and  must  be  a  directory  This  command  can  also  be  abbreviated  to  ed  or 
chdir.  One  note  of  caution  just  as  in  the  shell  version  of  the  command,  a  directory 
changed  to  is  the  current  working  directory  until  another  instance  of  the  command. 
Since  many  csr  commands  take  pathnames  as  arguments,  be  extremely  careful 
when  using  non-rooted  pathnames,  since  they  will  be  affected  by  the  current  work¬ 
ing  directory  and  its  location  in  the  file  hierarchy. 

3.2.5.  clear_template  uiord  [directory] 

The  clear_template  command  allows  the  user  to  discard  all  or  some  of  the 
repetitions  of  a  trained  vocabulary  word.  The  first  argument  is  always  required  and 
is  the  vocabulary  word  whose  template(s)  need  to  be  retrained  The  second  argu¬ 
ment  is  an  optional  directory  from  which  to  clear  the  word.  If  the  directory  is  not 
specified,  the  word  is  cleared  from  all  directories  owned  by  the  user. 

Recall  that  when  in  training,  template  files  are  created  read-only.  Hus 
effectively  prohibits  accidental  destruction  through  retraining.  If  a  retraining  of  a 
word  is  desired,  the  word's  template(s)  must  be  cleared  first  and  then  retrained. 
Since  cLearJtemplate  accomplishes  its  task  by  merely  changing  the  access  mode  of 
the  template  file(s)  to  read/write  by  all,  a  cleared  template  is  not  destroyed  until  a 
retraining  is  done. 

3.2.6.  commandjfile  commancL/Ue^pathname 

This  command  is  extremely  useful  for  long  or  repetitive  sequences  of  CSR  com¬ 
mands.  Commands  are  entered  into  a  file  with  the  assistance  of  one  of  the  many 
available  text  editors.  Then,  each  time  this  command  is  executed  the  commands  in 
the  specified  file  are  performed  as  if  they  were  entered  by  the  user  from  the  key¬ 
board  (with  the  exception  of  those  keyboard-only  commands).  Command  files  may 
be  nested  to  a  depth  of  9,  which  should  be  more  than  adequate  for  most  applica¬ 
tions.  This  number  can  be  raised  to  a  maximum  of  15,  if  absolutely  necessary  by 
changing  the  value  of  the  defined  constant  CFM—NEST  to  15. 
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3.2.7.  control 

The  control  command  places  the  CSR  system  into  Voice  Control  mo  ’e  (see  §2) 
Once  in  Voice  Control  mode,  command  input  from  the  terminal  and  any  open  com¬ 
mand  flies  is  suspended  There  is  a  small  subset  of  commands  that  are  operative 
during  Voice  Control  mode.  These  include: 

•  display 

•  offline 

•  options 

•  recognize 

•  word-scores 

This  command  can  being  entered  from  the  terminal  keyboard  or  spoken  during 
a  recognition  trial.  The  user  can  exit  from  Voice  Control  mode  by  saying 
offline  or  by  hitting  an  interrupt.  In  either  case,  the  program  returns  to  Com¬ 
mand  mode. 

3.2.B.  display 

This  command  displays  the  current  values  of  the  CSR  system's  settable 
engineering  parameters.  It  is  equivalent  to  typing  set  with  no  arguments.  It  is  valid 
both  as  a  keyboard  entry  and  as  a  spoken  command  while  inVoice  Control  mode. 

3.2.9.  document  [commandjiame  ] 

This  command,  when  entered  from  the  keyboard  only,  allows  the  user  to  peruse 
various  external  documentation  files  concerning  valid  CSR  commands.  If  no  com- 
mand  names  are  specified,  the  entire  list  of  command  documentation  is  shown  to 
the  user,  in  alphabetical  order,  one  page  at  a  time.  If  command  name(s)  are 
specified,  only  those  commands  are  documented.  In  either  case,  the  program 
fork/executes  a  UNIX  shell  to  page  through  the  output  using  the  more(l)  command. 
The  external  documentation  is  located  in  /usrl/a/sr/csrz'csr.doc. 

3.2.10.  experiment  [off  |  on  [experiment  jresults^)ath.nam.e]  ] 

The  experiment  command  changes  the  current  state  of  experiment  processing 
When  turned  on,  recognition  results,  as  well  as  verbose  program  output,  are  written 
to  the  experiment  results  file  instead  of  to  the  standard  output  unit. 

When  used  with  no  arguments,  the  current  state  is  toggled  (from  off  to  on,  or 
vice-versa).  If  only  one  argument  is  present,  it  must  be  either  the  string  cffR  or  cn 
When  turning  experiment  processing  on  in  this  case,  the  previously  used  or  default 
filename  is  opened  as  the  output  file.  The  default  pathname  is  of  the  form 

/tmp/E LUl.pppppp[ee] 

where  "llll"  is  the  first  four  letters  of  the  user’s  logon  name,  "pppppp"  is  the  zero- 
filled,  six-digit  process  ID  number  (of  the  object  program),  and  "ee"  is  an  optional 
extension  that  the  user  specified  at  program  invocation  by  using  the  -e  switch  If 
the  user  specifies  a  pathname  when  turning  the  experiment  processing  on,  that 
pathname  takes  precedence  over  the  default  name. 

3.2. 11.  help  [commandLname  ] 

This  command,  when  entered  from  the  keyboard  only,  allows  the  user  to  view 
one-line  summaries  of  valid  CSR  commands.  If  no  command  names  are  specified,  the 
entire  list  of  command  summaries  is  shown  to  the  user,  in  alphabetical  order,  one 
page  at  a  time.  If  command  name(s)  are  specified,  only  those  commands  are  sum¬ 
marized.  For  more  detailed  documentation,  use  the  document  command. 
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3.2.12.  linear-transform  [off  \  on  [Line ar_transfarmation_pa.thn.ame]  ] 

The  linear-transform  command  changes  the  curent  state  of  linear  transforma¬ 
tion  processing.  When  turned  on.  all  subsequent  speech  input  is  transformed  by  the 
linear  transformation  matrix  that  was  last  specified 

When  used  with  no  arguments,  the  current  state  is  toggled  (from  off  to  on.  or 
vice-versa)  If  only  one  argument  is  present,  it  must  be  either  the  string  off  or  on. 
When  turning  hnear  transformation  processing  on  in  this  case,  the  previously  used 
filename  is  used  as  the  transformation.  The  default  transformation  pathname  is 
undefined  (null)  and  thus  the  user  must  either  use  the  set-environment  command 
to  set  the  Transform  environment  value  to  the  appropriate  transformation  path¬ 
name  or  the  linear  transformation  pathname  must  be  explicitly  specified  the  first 
time. 

See  §1.2  in  Data  Descriptions  for  more  details 

3.2.13.  load-syntax  syntax_specificalwn_paihnarne 

The  load-syntax  command  sets  up  the  syntax  data  structures  that  are  neces¬ 
sary  to  drive  the  recognition  algorithm.  The  DPA  process  is  syntax-directed  and  v.Ll 
not  work  without  some  land  of  syntax  present.  Currently,  for  simple  grammars  that 
operate  without  syntax  (for  example  continuous  digits)  there  is  a  dummy  syntax 
available  in  /speech/csr/syx/anyd.syx  that  permits  any  combination  and  order  of 
vocabulary  words  More  complex  grammars,  not  surprisingly,  have  more  complex 
syntax  specifications. 

3.2.14.  quit/^D  (control-D) 

The  quit  command  is  the  only  way  to  gracefully  exit  the  program  explicitly.  If 
experiment  processing  is  on,  it  is  turned  off  and  results  are  summarized  prior  to 
program  exit.  An  alternate  and  equivalent  way  to  quit  the  program  is  to  enter  a 
control-D. 

3.2.15.  read-template  speech_fUe_pathname  [template-#]  [nosyntax] 

This  command  allows  the  user  to  load  template  memory  and  to  indicate  the 
template  number  to  be  assigned  and  whether  the  template  is  constrained  by  the 
syntax.  It  and  downline-load  are  the  only  ways  to  load  template  memory 

When  only  a  single  argument  is  present,  it  is  the  name  of  a  UNIX  file  that  con¬ 
tains  speech  written  as  described  m  §1.1  of  Data  Descriptions.  The  source  of  the 
data  can  be  either  PDP-11/60  or  VAX-11  /780  which  is  indicated  by  the  Source 
environment  value  The  file  is  opened,  the  data  is  read  into  buffers  and  the  buffers 
are  pre-processed  before  being  stored  into  template  memory.  This  pre-processing 
is  necessary  due  to  the  differences  in  how  network  information  is  stored  In  the  sin¬ 
gle  argument  case,  the  first  available  template  number  (starting  at  1)  is  assigned. 
If  a  second  argument  is  present,  it  is  assumed  to  be  the  template  number  or  the 
string  "nosyntax"  (or  "nosyn"  for  short).  In  the  former  case,  the  specified  number 
is  assigned  to  the  template  and  any  existing  template  with  that  number  is  overwrit¬ 
ten.  If  the  "nosyn"  string  is  present  as  the  final  argument,  it  indicates  that  this 
template  is  not  constrained  by  syntax,  i  e  the  template  can  match  any  unknown 
utterance  at  any  point  in  the  partial  phrase.  A  typical  use  for  the  meta-syntax  indi¬ 
cation  is  for  the  silence  template.  It  is  not  part  of  the  grammar,  vocabulary  or  syn¬ 
tax,  but  it  should  be  considered  a  candidate  for  matching  within  the  DPA  process  at 
any  time. 

Although  template  numbers  can  be  specified  when  reading  speech  templates 
into  memory,  it  is  advised  against  doing  so.  This  is  due  to  the  restrictions  that  the 
syntax  imposes.  Templates  must  be  loaded  in  the  same  order  that  the  syntax 
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specification  is  laid  out  The  DPA  process  expects  template  numbers  to  begin  v,  ilh  ’ 
and  increment  with  no  "holes",  or  missing  templates  If  a  template  number  :s 
expected  to  be  filled  with  a  template  and  it  is  empty,  the  DPA  process  \  ill  f it  to 

work  properly 

3.2.16.  recognizeuspeech  speech-file -pathname 

This  command  performs  a  recognition  trial  between  the  speech  contained  m 
the  specified  file  (the  unknown)  and  all  templates  that  are  in  template  memory. 
There  must  be  at  least  one  template  m  memory  for  the  DPA  process  to  be  initiated 

Each  frame  of  the  unknown  is  compared  against  all  frames  of  all  templates  n 
memory  that  syntax  considers  acceptable.  Synti.x  may  reject  whole  templates,  or 
portions  of  templates  due  to  syntactic  restrictions.  For  a  given  unknown /temp!  ‘e 
pair,  a  DPA  matrix  can  be  output  if  both  analysis  mode  is  on  and  the  AnalyAav i 
environment  value  is  set  to  the  desired  template  number  This  matrix  contains 
detailed  information  that  was  used  by  the  DPA  process  to  determine  the  b:.it 
matching  template.  This  includes  inter-frame  distances  and  indicates  the  path  the 
DPA  process  chose  as  the  best  one  through  the  matrix  (see  §1.7  of  Data  Descriptions 
for  an  example). 

After  all  templates  have  been  compared  to  each  frame  of  the  unknown,  the 
unknown  phrase  is  printed  along  with  the  best  matching  choices  from  among  the 
templates.  In  continuous  speech  applications,  the  unknown  is  typically  a  multi¬ 
word  phrase  and  the  best  options  are  combinations  of  vocabulary  words  that  proved 
closest  to  the  unknown.  The  best  matching  choice  is  presented  first,  along  with  its 
score.  Following  this  are  the  next  best  choices  and  scores,  one  choice  to  each  line, 
with  the  better  choices  appearing  first. 

3.2.17.  reseLenvironment  [va7ia.ble_xia.me  ...] 

The  reseL_environment  command  sets  environment  values  back  to  their  initial 
(default)  values.  Ail  variables  can  be  reset  or  just  those  specified  In  either  case, 
those  variables  that  are  reset  are  displayed  with  their  new,  initial  values. 

This  command  is  useful  for  duplicating  the  environment  for  respective  experi¬ 
ments  in  a  single  run.  NOTE:  Variables  are  reset  to  the  values  that  they  have  whan 
the  program  is  executed,  not  necessarily  to  the  values  that  they  had  when  the  first 
recognition  trial  was  made. 

3.2. IB.  aetuenvironment  [vanable^name  neuuualue]  ... 

This  command  modifies  selected  environment  values  or  displays  the  total 
environment.  The  number  of  arguments  determines  its  processing.  If  no  argu¬ 
ments  are  present,  the  CSR  environment  is  displayed  with  no  changes.  If  argumsr.ts 
are  present,  they  should  be  in  pairs.  Each  pair  specifying  a  valid  environment  vari¬ 
able  name  and  a  value  to  set  it  to.  Variabte  names  can  be  abbreviated  to  as  few 
characters  as  are  necessary  to  guarantee  uniqueness.  It  is  considered  to  be  a  fatal 
error  to  reference  an  unknown  or  ambiguous  variable  or  to  set  a  variable  to  a  value 
out  of  range  or  of  the  wrong  type 

3.2.19.  signal  UNlXsignaLname  [off  |  on] 

The  signal  command  causes  specified  UNIX  signals  to  be  ignored  or  to  terminate 
the  program  (default  setting)  upon  their  receipt.  It  is  a  very  simple  interface  to  the 
signal  (2)  function  that  the  UNIX  kernel  makes  available.  If  a  given  signal  is  being 
ignored  and  it  is  received  by  the  program,  the  signal  name  will  be  displayed  to  the 
user  to  indicate  that  the  signal  was  received.  This  notification  can  be  important 
since  some  signals  cause  system  calls  (such  as  reads  and  writes)  to  fail,  regardless 
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of  their  effect  on  the  receiving  program  For  example,  the  user  can  specify  that 
interrupts  be  ignored  However,  if  one  is  received  when  a  read  is  outstanding,  the 
read  fails  and  returns  an  error  status. 

Although  any  or  all  signals  can  be  ignored,  it  is  best  to  only  ignore  those  signals 
that  are  user-generated  Namely,  hangup,  interrupt  and  quit.  It  serves  little  pur¬ 
pose  to  ignore  a  bus  error,  segmentation  violation  and  others  since  they  indicate 
very  serious  program  flaws  that  will  cause  program  termination  sooner  or  later. 
This  command  is  best  used  carefully. 

3.2.20.  summary  [o ne-line^command] 

This  summary  command  causes  the  current  experiment  counters  to  be  written 
out  to  the  experiment  results  file,  following  an  optional  single-line  message.  If 
experiment  processing  is  turned  off,  this  command  does  nothing.  Counters  written 
out  include  the  number  of  phrase  (recognition)  trials,  number  of  correct  options 
and  errors  (accompanied  by  percentages),  number  of  word  (sub-recogniticr.) 
results,  correct  matches  and  errors  and  the  number  of  times  the  silence  template 
matched.  Although  the  output  looks  identical  to  that  produced  when  experiment 
mode  is  turned  off  or  when  the  program  terminates  normally  with  experiment  pro¬ 
cessing  on,  there  is  one  big  difference:  summary  does  not  reset  the  counters  back 
to  zero.  The  main  advantage  to  using  this  command  is  that  if  multiple  speakers  are 
being  recognized  in  an  experiment,  following  each  speaker  the  accumulated,  totals 
to  date  can  be  written  out.  Notice  that  since  the  counters  aren't  reset,  all  counts 
are  accumulated  from  the  last  time  the  experiment  processing  was  turned  on. 

3.2.21.  tty_input 

The  tty_input  command,  when  encountered  in  a  command  file  only,  causes  the 
system  to  audibly  prompt  the  user  for  keyboard  input  and  continue  accepting  input 
from  the  keyboard  until  the  user  enters  a  control-D  or  a  quit  command.  The  sys¬ 
tem  also  uses  a  different  prompt  (*)  to  distinguish  to  the  user  that  this  keyboard 
input  has  been  requested  by  a  command  file.  This  command  is  useful  when  the  usrr 
wishes  to  check  intermediate  results  during  long  runs  After  the  user  terminates 
the  keyboard  input,  processing  of  the  command  file  continues  with  the  next  com¬ 
mand.  Entering  this  command  from  the  keyboard  elicits  a  warning  message,  if  the 
verbose  output  mode  is  on. 

3.2.22.  imix  [C^sheLLxommand] 

This  command  causes  a  temporary  UNIX  C  shell  to  be  run,  temporarily  suspend¬ 
ing  the  CSR  system  for  the  duration  If  no  arguments  are  present,  a  shell  is  created 
and  run  until  the  user  terminates  it  with  a  control-D.  Normal  shell  aliases,  search 
paths  and  variables  are  set  to  those  that  the  user  would  have  when  logging  in  to 
UNIX  If  argument(s)  are  present,  the  first  one  is  assumed  to  be  the  name  of  a  pro¬ 
gram  to  run  at  the  shell  level.  Second  and  subsequent  arguments  are  inputs  to  the 
program.  This  version  of  the  shell  is  historically  called  a  mini-shell  since  it  exe¬ 
cutes  the  specified  program  and  immediately  exits  back  to  the  CSR  system.  Notice 
that  when  arguments  are  specified  to  the  unir  command,  the  user  needn't  (and 
shouldn't)  enter  a  control-D  to  terminate  the  shell. 

3.2.23.  upline-load foTTnaltedLiemplate-pathname  [ template ...] 

The  upline_load  command  allows  the  user  to  write  templates  out  of  template 
memory  to  an  external  file.  The  main  difference  between  this  command  and  the 
wrileuiemplate  command  is  the  structure  of  the  output  files.  This  command 
creates  a  file  that  can  be  read  by  the  downline  load  command.  The  file  is  formatted 
exactly  as  template  memory  is  so  there  is  no  need  for  pre-processing  of  the  speech 
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data  as  there  is  with  the  write_tempiate  command. 

When  a  single  argument  is  present,  it  is  the  pathname  of  a  file  to  create.  If 
multiple  arguments  are  present,  all  arguments  following  the  pathname  are  tem¬ 
plate  numbers  to  write  out.  If  no  numbers  are  present,  all  templates  are  written 
out  in  numeric  order. 

3.2.24.  version 

The  version  command  causes  severed  lines  of  information  concerning  the  CSR 
system  to  be  displayed  on  the  standard  output  unit.  The  information  includes  the 
program's  name,  version  number,  resident  directory  (of  the  object)  and  the  last 
date  it  was  compiled. 

3.2.25.  write_template  speech^/Ue^pathname  [template^  . ..] 

This  command  writes  templates  out  of  template  memory  to  an  external  file.  All 
of  template  memory  can  be  written  out  or  selected  templates  can  be  output 
depending  upon  whether  template  numbers  are  present  or  not.  The  output  file  has 
the  same  format  as  that  described  in  §1.1  of  Data  Descriptions.  Although  all  of  tem¬ 
plate  memory  can  be  written  out,  it  is  advised  that  each  template  be  written  to  a 
separate  file.  The  only  files  that  are  currently  supported  that  contain  multiple 
speech  files  are  PDP-11/60  archive  files  (see  archive  command  description).  Writing 
the  template  out  does  not  affect  its  representation  in  template  memory.  Make  sure 
that  the  current  working  directory  of  the  program  allows  writing  or  specify  a  rooted 
pathname  as  the  output  file  name. 

4.  Environment  Variables 

4.1.  Variable  List 

The  CSR  environment  variables  control  the  operation  of  the  system.  Most  of  the 
environment  variables  pertain  to  the  recognition  process  and  don't  affect  any  other 
parts  of  the  system.  The  environment  variables  are  listed  below  with  a  short 
description,  its  type  (as  declared  in  C),  its  default  value  and  the  range  of  values  that 
it  can  be  set  to.  When  specifying  variable  names,  only  enough  characters  to  guaran¬ 
tee  uniqueness  are  required.  Thus,  L,  Lem  and  Lcmdim  are  all  names  for  the  same 
variable.  However,  Amp  is  ambiguous  since  both  ArnpjriorrriaLize  and  Amp^thmshold 
begin  with  the  string  "Amp". 
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Figure  X.  Summary  of  CSR  Environment  Variables 

4.2.  Environment  Variable  Descriptions 

4.2.1.  Amp_normalize  -  Amplitude  normalization  flag 

The  amplitude  normalization  flag  controls  whether  normalization  of  amplitude 
is  performed  on  all  subsequent  inputs  of  speech  data,  both  templates  and  unk¬ 
nowns.  Amplitude  normalization  involves  replacing  the  input  amplitude  with  an 
average  amplitude  obtained  by  summing  the  frame  parameters  and  dividing  by  the 
number  of  parameters.  Rounding  is  used  when  the  number  of  parameters  is  not 
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even 


4.2.2.  Amp  -threshold  --  Threshold  of  speech 

The  amplitude  threshold  is  a  value  that  delimits  speech  from  "silence"  It  is 
only  used  when  endpoint  detection  is  being  done,  usually  when  live  speech  is  being 
processed  (which  Is  seldom  or  never).  Each  input  frame's  amplitude  is  compared 
with  the  threshold  value  and  if  it  is  less  than  or  equal  to  the  threshold,  the  frame  is 
assumed  to  be  silence.  Conversely,  if  the  amplitude  value  is  greater  than  the  thres¬ 
hold,  the  speech  frame  is  assumed  to  be  speech. 

4.2.3.  Analy  —  Analysis  processing  mode  indicator 

This  toggle  value  is  read-only  and  cannot  be  set  using  the  set-environment 
command.  Rather,  it  indicates  in  an  off /on  manner  the  current  state  cf  analysis 
processing.  Use  the  analysis  command  to  change  the  state  of  analysis  mode  pro¬ 
cessing. 

4.2.4.  Analy_iem  --  Analysis  template  number 

This  value,  when  set  to  a  template  number  with  analysis  processing  on,  causes 
the  unknown/template  DPA  matrix  and  scoring  information  to  be  written  out  to  the 
DPA  analysis  file.  Vilien  set  to  -1  or  0,  or  when  analysis  processing  is  off,  does  not 
affect  the  program  or  its  results. 

4.2.5.  Beam  factor  --  Beam  search  multiplier 

The  beam_Iactor  defines  a  window  on  the  partial  phrase  scores  which  elim¬ 
inates  some  scores  from  post-processing.  After  each  unknown  frame  is  processed, 
the  beam-factor  is  multiplied  with  the  best  (lowest)  score.  The  resulting  product  is 
stored  in  the  BeamJhreshald  environment  variable  and  defines  the  largest  score  to 
process  during  scoring.  Typically,  the  window  defined  by  the  product  eliminates 
70%  of  the  scores  as  being  too  large.  The  scoring  algorithms  are  very  sensitive  to 
this  value  and  even  small  changes  can  affect  results. 

4.2.6.  Beam-threshold--  Beam  search  threshold 

This  value  gives  an  upper-limit  on  the  recognition  scores  that  will  be  processed 
following  each  pass  of  the  unknown  frame.  It  is  produced  by  multiplying  the  best 
(lowest)  score  on  a  given  pass  with  the  Beam^factar  environment  value.  It  can  be 
set  before  each  recognition  trial,  however  it  is  reset  after  processing  each  unknown 
frame. 

4.2.7.  Beglnpen  --  Starting  path  penalty 

This  value  gives  a  penalty  increment  to  apply  to  templates  that  start  paths  for 
each  unknown  frame.  The  DPA  process  and  syntax  specification  may  allow  tem¬ 
plates  to  begin  a  path  anywhere  within  the  unknown  utterance.  However,  as  cne 
proceeds  along  the  unknown,  the  penalty  assigned  to  templates  that  begin 
increases  until  a  path  through  the  DPA  matrix  ends.  This  prevents  the  first  few 
frames  from  being  discarded  just  because  they  don’t  match  any  of  the  templates 
very  well.  Typically,  this  value  is  set  to  BO  which  defines  the  starting  path  penalty  to 
be  1  (on  unknown  frame  0),  81  (frame  1),  161  (frame  2)  and  so  on  until  a  path  is 
completed,  at  which  time  the  penalty  is  set  to  the  ending  template's  score  and 
incremented  each  unknown  frame  by  Beginpen. 
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APPENDIX  B 


A  CONTINUOUS  SPEECH  DATA  BASE 


i 


*  CONTINUOUS  SPEECH  DATA  BASE  ■ 


8.  P.  Lindtll,  A.  t.  Salth,  H.  M.  Noble,  and  N.  L.  Alcove 


ITT  Dafanaa  Coaaun teat  Ions  Division 
10060  Carroll  Canyon  Road 
San  Blago,  California  92131 


2-  Introduction 

The  development  of  contlnuoua  apaaeh 
recognition  algorithms  requires  esttnsive 
taatlng  on  large  data  bases  from  a  wide  saa- 
pllng  of  tha  apaakar  population  to  obtain 
atatiatlcally  significant  parforaanca  data. 
An  attaapt  to  daalgn  and  racord  auch  a  data 
baaa  haa  baan  aada  by  tha  ITT  Dafanaa  Coaaun- 
icatioaa  Division.  Tha  data  baaa  haa  aavaral 
uaaful  ooaponanta  Including  phraaa  aata  gan- 
•ratad  fro*  prograaalvaly  larger  finite  atata 
Braaaara,  a  oonnactad  digit  eoaponant,  an 
alphabet  apalllng  coaponant,  and  a  dlagnoatle 
rhyaa  coaponant.  Thla  paper  daacrlbaa  tha 
content  of  the  data  baaa  and  tha  recording 
facilltlaa  and  procaduraa  uaad  to  collect  tha 
speech  data. 


2.  DATA  BASE  DESCRIPTION 

The  contlnuoua  apaaeh  recognition  data 
baaa  daalgnad  by  tha  ITT  Defense  Coaaunlca¬ 
tlona  Division  Is  suaaarlzad  In  Table  1. 
Speech  has  been  recorded  froa  a  speaker  popu¬ 
lation  eonalating  of  25  aalaa  and  25  faaalaa. 
Tha  spoken  aatarial  involves  training  and 
test  uttarancaa  froa  seven  data  baaa  coa- 
ponanta  which  will  now  be  daaerlbad  In 
detail . 


1-2  Airline  Seta 

Tha  ooaponanta  denoted  as  Airline  Seta 
1-*  in  Table  1  are  Interrelated.  This  por¬ 
tion  of  tha  data  baaa  haa  been  designed  to  be 


Tabta  1 

URSA  BAST  SUMMARY 


omsAst 

wuisraoF 

VOCABUUir 

HUMBER  Of 
VOCABULARY 
ivLmun 

COKTVUOSB 

nmea 

KUkQiao? 

KODI5  !V 
CBaMMaR 

Airline  Sal  1 

29  oul« 
tt  fimii 

S3  words 

2 

SO  phrases 

21 

Atrium  Set  Z 

tSnuli 

26  IfDMlt 

100  words 

9 

SO  phrases  ♦ 

SS  phrases  lor 

90 

Airline  Sat  3 

11  m*U 

1 1 

Zll  words 

3 

80  phrases 

92 

AirtuM  Sate 

]  1  mile 

1 1  ftaulc 

30]  words 

3 

50  phrases 

128 

Digit  String* 

10  mfti* 

10  (tax*  f 

]0  digits 

3 

ISO  digit 
slrtngr 

Alphabet/ 

Word  Spelling 

10  mtl* 

10  ftatlt 

ZS  letters 

0 

30  word 
spellings 

Diagnostic  Rhyme 
-Initial  Cooson 

2  m*)» 

2  f«maJ« 

] ZS  words 

7 

-Final  Conaen 

2  m*l« 

C  fimili 

123  words 

7 

»  This  work  wei  perforned  under  RADC  contract  nuaber  F30602-81-C0155. 
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Figure  1.  Finlta  Scat* 


MW  YORK 

ICW  YCRK 

PITTSBURGH 

rmsauaf 

SAN  DIEGO 

smi  budd 

ftw  For  Air  Una  Sat  1  (S3  uoxd  wKilary.  21  nodes) 


representative  of  United  syntax  application 
areas  for  spaach  recognition.  The  phrases  la 
the  Airline  Sets  are  rapraaantatlva  of  a  am¬ 
plified  air  travel  lnforaatlon  retrieval 
application. 

Tae  sentences  in  Airline  Set  i  Mere  gen¬ 
erated  by  the  rinlte  state  graaaar  depicted 
in  Figure  1.  The  graaaar  contains  21  nodes 
(including  the  start  and  end  node)  and  a  53 
uord  vocabulary.  Craaaars  for  Airline  Sets 
2-»  were  than  designed  by  progressively 
eapanding  this  core  finite  state  graaaar  to 
include  additional  nodes  and  node  connections 
as  well  as  additional  words  In  tha  vocabu¬ 
lary.  Tnus,  for  eiaaple,  any  phrase  generated 
froa  the  Airline  Set  2  graaaar  can  also  be 
generated  froa  the  graaaar  of  Airline  Sets  3 
and  a.  However,  such  a  phrase  cannot  neces¬ 
sarily  be  generated  by  the  graaaar  of  Airline 
Set  1.  Table  2  gives  esaaples  or  the  types 
of  sentences  that  are  added  by  Airline  Sets 
2,  3,  and  «. 

Table  2 

examples  or  phrasb  rani  airline  sets  »-t 

brtmi  Sri  l 

■taper*,  lbs  Sight  schedule  of  aircraft  char  lit  yenkee 
lour  a'.tvac  from  Beaton 

Tall  ir.c  '.ha  departure  time  ol  Seslem  Sight  number 
fifteen. 

Airline  Jet  3 

I  want  to  return  Irem  New  Orleans  to  Pnoemx  on  lion- 
day  the  thirteenth  of  January 

figs.*.  sened_e  from  Hartlord  Connesueut  lo  Portland 
Oregcr.  m  Friday.  April  twenty  filth 

Airline  Jit  - 

)  am  flay  .-g  at  tha  Royal  Inn  hola! 

1  «o— f  '..lie  to  get  torn*  information  about  my  riser- 
va'-cn 

V>  It'f  pr.cae  number  is  32V  6665 
I  wii.  pay  with  my  Amarican  Expreu  card 
I  would  lute  to  roport  a  green  coat  loat  on  Alleghany 
two  twe’.re 


Table  3  shows  the  growth  In  coaploxitv 
of  tha  airllna  graaaar s  by  aavaral  different 
aeasures.  The  fourth  coluan  shows  that  the 
aailaua  santanca  langth  rangaa  froa  11  words 
to  22  words,  whlla  tha  avaraga  santanca 
langth  lncroasss  froa  9.6  words  to  18.6. 
Slailarly,  In  coluan  five,  the  total  nuabor 
of  phrbaaa  generated  by  each  graaaar  ranges 
by  i  orders  or  aagnltudo  froa  10*  phrases  to 
10 1  *  phraaos.  The  last  coluan  gives  an 

lnforaatlon  thaorntlc  aoaauro  of  coaplnxity. 
The  perpleilty*  (eetropy  «  logy  perplexity) 
aeasures  tha  average  ouaber  of  uord  choices 
at  each  uord  position  of  tha  language.  Thus, 
for  exaapla,  a  recognition  systaa  using  Air¬ 
llna  Graaaar  1  would  have  to  choose  between 
approx laately  5.T  words  on  tha  avaraga  as  it 
identified  oach  new  word  io  the  aeoteace. 
without  tha  graaaar,  hut  with  tha  saav  voca¬ 
bulary,  all  53  words  would  coapets  at  each 
word  position. 


Maximum 
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•  Jeunek,  f.,  hereer,  *.  L. ,  Sahl,  L.  (., 
and  Baker,  J.  ft.,  ’Perplexity  •  A  Measure 
of  Difficulty  of  Speech  Recognition 
Tasks",  The  Journal  of  the  Acoustical  So¬ 
ciety  of  Aaerlea,  Vol.  62  SI,  1977, 
abstract. 


According  to  the  perplexity,  and  to  the 
c  i  e  of  precision  given  In  the  table,  Air¬ 
line  Craaaar  a  has  the  ease  conpleilty  as 
Airline  Craaaar  3  even  though  Sat  A  has  a 
larger  vocabulary.  This  Is  true  because  the 
new  words  of  Sat  A  are  added  In  new  finite 
state  paths  which  do  not  significantly 
lr.creaae  the  total  nuabar  of  Phrases  (i. a. , 
about  10' phrases  wars  added  to  the  10*phrascs 
in  Sat  3). 

As  shown  in  Tabla  1,  all  90  speakers 
recorded  training  and  test  utterances  far 
Airline  Sets  1-2.  Training  data  was  col¬ 
lected  for  each  speaker  by  recording  three 
isolated  word  repetitions  of  the  100  word 
vocabulary  associated  with  Alrllna  Sat  2 
which  Included  the  53  word  subset  associated 
with  Airline  Sat  1.  The  order  of  presenta¬ 
tion  for  those  words  varied  with  each  repeti¬ 
tion.  Additional  training  aaterial  was 
obt  ..nod  by  recording  66  phrases  generated  by 
the  finite  state  granncr  defining  Airline  Sat 
2.  This  sat  of  phrases  contains  at  least  two 
.occurrences  of  each  word  in  the  100  word 
vocabulary  which  aay  be  used  for  extracting 
word  taaplataa. 

A  total  of  22  speakers  recorded  training 
and  test  utterances  for  Alrllna  Sets  3-4. 
Training  data  was  eollactad  for  each  speaker 
by  recording  throe  isolated  word  rapatltions 
of  only  the  new  vocabulary  words  added  by 
Airline  Set  3  (111  words)  and  Alrllna  Sat  A 
(90  words).  Tha  order  of  presantatlon  for 
these  additional  words  varied  with  each 
repetition. 

The  test  aaterial  for  each  Airline  Set 
was  obtained  by  recording  50  pbraaes  per 
speaker.  To  obtain  both  a  large  variety  of 
phrases  and  coamon  test  data  across  different 
speakers,  the  speakers  were  divided  into 
groups.  Each  group  recorded  a  different  col¬ 
lection  of  50  phrases.  So  aore  than  seven 
speakers  were  assigned  to  tha  saae  group. 
Phrases  for  each  Airline  Set  were  generated 
randoaly  froa  the  associated  finite  state 
graaaar  with  tha  following  constraints.  ho 
phrase  was  generated  twice  within  a  group  of 
50  phrases  (and  raraly  across  any  two 
groups).  Also,  any  phrasa  generated  for  Air¬ 
line  Sets  2-A  contained  at  least  one  word 
that  was  not  a  eeraber  of  the  vocabulary  of 
the  sasller  Airline  Sets. 


Olglt  Strings 


Tha  fifth  component  of  tha  data  base 
listed  in  Tabla  1  was  designed  to  focus 
attention  on  the  problea  of  recognising 
atrlngs  of  connected  digits.  Teaplata  train¬ 
ing  data  was  obtained  by  having  each  of  tha 
23  speakers  record  three  isolated  word 
repetitions  of  each  digit  xaro  through  nine. 
Tha  order  of  presentation  differed  with  each 
repetition.  The  teat  data  Involves  a  sat  of 
153  digit  strings  containing  A0  three-digit 
strings,  A0  four-digit  strings,  50  five-digit 
strings,  and  20  seven-digit  strings.  Tha 
seven-digit  strings  ware  presented  to  the 
speaker  using  tha  pattern  XXX-XXXX,  thereby 
giving  the  appearance  of  a  telephone  nuabar. 
Thus,  a  total  of  670  digits  are  present  in 
the  150  digit  strings. 


The  list  of  digit  strings  have  the  fol¬ 
lowing  additional  properties: 


1.  Every  digit  (xero  through  nine) 

occurs  67  tiaes. 

2.  Every  digit  occurs  as  the  first 

digit  15  tiaes. 

3.  Every  digit  occurs  as  the  last 

digit  15  tiaes. 

A.  In  tha  20  strings  of  seven  digits, 
every  digit  occurs  in  tha  position 
inaediately  before  the  hyphen 

twlca.  Every  digit  occurs  in  the 
position  Inaediately  after  the 
hyphen  twice.  Finally,  the  pair  of 
digits  separated  by  the  hyphen  are 
all  distinct. 


Alphabst/Mord  Spallings 


Tha  sixth  coaponent  of  the  data  base 
shown  in  Table  1  hss  been  designed  to  focus 
attention  on  the  problea  of  recognising 
letters  eontalnad  in  spelled  words.  Teaplate 
training  data  has  been  obtained  froa  each  of 
the  20  spankers  by  recording  five  isolated 
word  repetitions  of  the  alphabet.  The  order 
in  which  tha  26  letters  were  presented  varied 
with  each  repetition.  To  obtain  test  data, 
eaeb  apaakar  waa  asked  to  spell  50  words  in  a 
continuous  speech  fashion.  Tha  words  were 
selected  froa  a  AOOO  word  vocabulary  with  a 
goal  of  aaxlalxing  the  nuaber  of  unique 
latter  pair  coablaationa  in  the  50  word  sub¬ 
net.  Theae  50  worda  are  ahown  in  Tabla  A. 


1.  absolution 

M  halfway 

36  rhythm 

2  barbartena 

IB.  buckskin 

36.  submerge 

9.  deployment 

SO.  knmovabia 

37.  imilying 

4.  expounding 

Si.  ketchup 

36  smart imly 

9.  geographer 

SS.  whiftwind 

SO.  anxiously 

S.  microscope 

S3  attainment 

40.  armchair 

7.  reassigned 

24.  suggested 

41.  bUUkriag 

6.  childbirth 

SS.  cuddly 

42.  bluebpek 

p.  laiaahoods 

SS.  mystique 

43.  breakdown 

10.  brusbwerfc 

27.  queue 

44.  bulky 

11.  Hstlassty 

SB.  thrifty 

46.  obtains 

IS.  eflectual 

SS.  adsaisary 

46.  gunpowder 

13  Jackboots 

30.  chauvinism 

47.  relaxation 

14.  overview 

31.  daarf 

48.  involve 

16.  avoidance 

32  hypocrisy 

40.  knuckle 

16.  adjomtag 

17.  ee.aee.lmg 

33.  cay gsn 

34.  punchbowl 

80.  lodgings 

2-i  Plainoetlc  Rhyaa  Test 


Tha  final  coaponent  of  the  data  base 
involves  a  dlagnaatic  rhyme  test.  The  voca¬ 
bulary  for  this  test  was  described  by  J.  D. 
Griffiths*.  It  consists  of  250  words  broken 
into  50  five-word  groups.  Words  within  a 
group  differ  only  in  a  particular  air.iaal 
feature.  In  25  of  the  groups,  tha  contrast¬ 
ing  eleaent  Is  the  final  consonant.  An  esaa- 
plt  group  is:  (dig,  din,  did,  dia,  dill).  In 
the  reaaining  25  groups,  the  contrasting  ale- 


■  unrntni,  J.  o.,  -Knyaing  Minlaal  Con¬ 
trasts:  A  Siaplified  Siagnostic  Articula¬ 
tion  Tost*,  Tha  Journal  of  the  Acoustical 
Society  of  Aaeriea,  Vol.  A2,  Wo.  1,  pp. 
236-2 A 1 ,  July  1967. 
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oent  Is  the  initial  consonant.  An  siaapla  contact  with  the  speaker  at  all  tines  via  the 

(roup  Is:  lway,  say,  (ay,  they,  nay).  This  double  paned  (lass  window  batween  the  rooas. 

coaponent  of  the  data  base  Is  useful  for  The  operator's  prlaary  function  was  to  (ulde 

snalyzin(  the  stren(ths  and  weaknesses  of  a  the  speaker  through  the  recordln(  session  by 

speech  recognition  systea.  controlling  both  the  sequencing  and  pacing  or 

aaterlal  displayed  on  the  video  aonitor  in 
A  total  of  sight  speakers  recorded  the  front  of  the  speaker, 

diagnostic  rhyae  tast.  Four  of  the  speakers 
recorded  the  teat  Involving  contrasting  Ini¬ 
tial  consonants;  the  other  four  recorded  the  J'i  Recording  Equlpaent 

test  involving  contrasting  final  consonants. 

The  125  words  associated  with  each  half  of  Speakers  node  their  speech  recordings 

the  test  were  presented  to  the  speakers  in  using  a  Shure  Nodal  SH10  professional  head- 

randoa  order.  Seven  repetitions  of  the  125  worn  aierophona.  This  device  is  light 

word  list  were  apokan  by  each  speaker  with  weight,  has  a  padded  headband  to  alnialze 

the  order  of  presentation  varying  for.  each  user  fatigue,  and  is  designed  for  close  talk 

repwtltion.  operation. 

Recordings  w are  Bade  using  a  Technics 
1-  EQUIfHEIIT  nodal  RS-1500US  taps  deck.  This  unit  was 

placed  within  easy  reach  of  the  operator.  All 
The  entire  data  bass  was  recorded  in  the  recordings  are  aonaural  and  were  Bade  at  a 

speech  research  laboratory  at  the  ITT  Defense  seven  and  one-half  lpa  tape  speed.  The  unit 

Coanunlcations  Division  facility  in  San  **•*  *  11,1  countar  which  shows  one-half  of 

Diego,  California.  This  laboratory  has  been  th*  •ct“al  for  tape  speed, 

carefully  designed  so  that  high-quality  audio 
recordings  of  huaan  speech  can  be  nade.  A 

scheaatic  dlagraa  showing  the  portion  of  the  *11  recordings  were  Bade  using  one- 

laboratory  which  was  utilised  to  record  this  quarter  inch  wide,  1200  feet  long  3M  Scotch 

continuous  speech  data  base  is  given  in  Fig-  200  audio  recording  tape  which  is  designed 

ur*  2.  for  alninal  print  through.  At  seven  and  one- 

half  ips  recording  spaed,  approsinately  30 
The  rooa  labelled  "recording  rooa"  in  Binutes  of  recording  tine  is  possible  on  a 

the  figure  contains  several  features  which  given  track  in  one  direction, 

are  conducive  to  the  recording  of  speech  in  a 

quiet  ataospherc.  The  walla  are  double  stud-  The  speaker's  atcrophone  was  connected 

ded,  fiber  glass  filled,  and  estend  froa  true  to  ■  Shura  Nodal  N67  professional  atcrophone 

floor  to  true  calling.  A  solid  core  door  is  alter  located  in  the  recording  rooa.  A  cable 

used  and  silencers  have  bean  eaployed  in  the  “»*  tl,an  r«"  froa  this  alter  through  the  wall 

ventilation  unit.  and  into  a  line  input  lack  on  the  rear  panel 

of  the  recorder  at  the  operator’s  station. 
A  portion  of  the  rooa  adjacent  to  the  The  aierophona  alter  was  required  to  give 

recording  room  was  configured  into  an  opera-  adequate  gain  for  weak  speakers  and  to  add  a 

tor  work  station  using  aovable  partitions.  synchronisation  tona  at  tape  startup 

Froa  this  position,  the  operator  had  visual  (described  below). 


Figure  2.  Schanatlc  Dlagras  of  Recording  Facilities 
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l’i.  OP«r *t or /Speak *>  ConuMcitlofi  Equipment 

Several  place*  of  apparatus  facilitated 
operator / subject  communication  and  autoaeted 
the  display  of  the  aaterial  to  be  recorded. 

Two  coaputer  teralnala  ware  utilised. 
On*  taralnal  was  positioned  In  th*  r* cording 
rooa  for  displaying  th*  words  and  phrase*  of 
eech  recording  session  to  the  speaker.  Th* 
speaker  was  seated  approi laataly  flv*  feat 
froa  th*  taralnal.  It  was  found  that  at  dis¬ 
tance*  less  than  three  fact,  th*  electroaag- 
netic  radiation  ealtted  froa  th*  taralnal 
aoniter  was  picked  up  aa  an  audibl*  signal  by 
th*  aierpphona.  The  sit*  of  the  characters 
in  th*  displayed  word  or  phrase  was  enlarged 
for  this  taralnal  to  aintaiz*  speaker  eye 
strain. 

Th*  aaterial  displayed  on  th*  taralnal 
In  th*  recording  rooa  was  controlled  by  th* 
operator  using  a  second  taralnal  In  th*  adja¬ 
cent  rooa.  With  this  taralnal,  th*  operator 
■was  able  to  access  all  data  base  software 
files  and  control  the  sequencing  and  pacing 
of  aaterial  displayed  to  th*  speaker. 

To  verbally  coaaunlcat*  with  the 
speaker,  th*  operator  used  a  push-to-talk 
intarcoa.  Th*  operator  was  able  to  aonltor 
the  audio  froa  both  the  source  (l.e  the 
speaker's  alcrophone)  and  th*  tap*  using  I03S 
Pro* Ail  headphones. 


A.  DATA  BASE  SOFTWARE 

Th*  entire  content  of  th*  data  base  as 
suaaarlzed  In  Table  1  was  organlsad  Into  a 
set  of  software  test  files.  These  files  were 
stored  on  a  PDP-11/60  coaputer.  kith  the 
file  Structure,  th*  aaterial  to  be  spoken  by 
*  given  speaker  could  be  defined  as  a 
specific  sequence  of  these  test  files.  More¬ 
over,  this  structure  aade  it  possible  to 
intersperse  training  and  test  utterances  for 
each  coaponent  or  th*  data  baa*  being 
recorded  by  th*  speaker. 

The  training  and  test  aaterial  for  a 
given  recording  session  was  presented  to  the 
speaker  on  the  video  terminal  using  proapting 
software  specially  created  for  this  project. 
In  order  to  evold  a  list  reeding  style,  this 
software  displayed  on*  utterance  (word  or 
phrase)  at  a  tla*  froa  th*  specified  text 
file  to  th*  apeak *r.  It  also  displayed  th* 
utterenc*  on  th*  teralnal  at  th*  operator's 
station,  along  with  an  index  nuaber . 

The  proapting  software  enabled  th* 
operator  via  keyboard  coaaands  to  control  th* 
aaterial  spoker.  by  th*  subject.  After  the 
speaker  completed  his  or  her  response  to  the 
displayed  word  or  phrase,  the  operator  could 
(*)  enter  a  'continue''  command  which  would 
cause  the  prompting  software  to  display  the 
next  utterance  ir.  the  text  file,  (6)  enter  a 
'repeat*  comoar.d  to  indicate  that  the  speaker 
had  corrected  an  error  atde  In  utterlrj  th* 
word  or  pnr  ase ,  or  (c)  enter  a  command  to 
nave  a  specified  utterance  from  the  file 
displayed  to  the  speaker.  Option  "c"  was 
primarily  utilized  By  the  operator  to 
redisplay  the  word  or  pnrese  just  uttered  by 
th*  speeker  if  an  error  hod  been  made  which 
wts  not  #*1 f-corrected  by  th*  speaker . 

Tn*  proapting  software  had  one  addi¬ 
tional  feature.  As  the  session  proceeded, 


the  proapting  software  crested  a  recording 
history  file.  This  file  contained  *  record 
of  each  uttsrenc*  spoken  by  th*  speaker,  and 
th*  relative  tla*  (In  tap*  counter  unite)  at 
which  th*  speaker  was  proapted.  All  repeat 
coaaands  war*  alto  included  in  the  history 
file  along  with  all  line  redisplay*  coaaandad 
by  the  operator.  Each  tla*  th*  audio  tep*  was 
stopped  during  th*  recording  session,  the 
operator  entered  th*  current  counter  reading 
froa  th*  tap*  recorder  into  th*  history  file. 
Than,  upon  restarting  the  tape,  a  ton* 
asquenc*  was  generated  by  the  software  and 
recorded  on  the  tap*  prior -to  display  of  the 
first  utterance  In  the  text  file.  This  tone 
sequence  Is  useful  during  taps  playback, 
audio  tap*  editing,  or  digitisation 

Th*  relative  proapt  tiaes  were  generated 
by  th*  coaputer  clock  and  were  Initialised 
relative  to  th*  atsrt-up  of  th*  tap*  recorder 
and  corresponding  ton*  sequence.  Thus,  a 
position  in  the  data  iaaediately  preceding 
each  spoken  utterance  can  be  accurately 
located . 


i-  PAT*  BASE  COLLECTION 

A  typical  recording  sasslon  lasted 
between  one  sod  one-half  and  two  hours.  Mott 
-speakers  had  very  Halted  inforaatlon  regard¬ 
ing  th*  nature  of  th*  activity  they  utre 
about  to  par  fora. 

Prior  to  antarlog  tha  recording  rooa, 
each  speaker  completed  a  questionnaire  In 
which  th*  following  inforaatlon  was  soli¬ 
cited:  sex,  age,  height,  weight,  place  of 
birth,  residences  sine*  blrtMclty,  state, 
country,  data*  resided  therein),  educational 
history  (naac  of  school,  location,  dates, 
dagrae,  aajor  field),  employment  history 
(occupation,  dales),  military  experience, 
languages  other  than  English  spoken  fluently 
(Including  whether  learned  as  a  child, 
teenager,  or  adult  and  th*  source  of  this 
knowledge).  In  addition,  the  questionnaire 
asked  th*  speaker  ir  he  or  she  had  any 
unusual  characteristics  In  their  conversa¬ 
tional  speech  pstterns  and  to  describe  any 
previous  experience  they  aay  have  had  as  a 
subject  in  a  speech  recognition  experlaent. 
While  the  speaker  was  filling  out  this  ques¬ 
tionnaire,  the  operator  was  checking  the 
recording  equipment  and  conditions  in  th* 
laboratory  to  ensure  that  th*  configuration 
uas  proper  for  th*  upcoming  recording  ses¬ 
sion  , 

Once  th*  questlonnelr*  hed  been  com¬ 
pleted,  th*  speaker  was  escorted  into  the 
recording  rooa  and  seated.  The  operator  then 
provided  the  speaker  with  specific  Inforaa¬ 
tlon  regarding  the  recording  session.  Speak¬ 
ing  from  s  set  of  notes,  th*  operator 
described  th*  content  of  th*  session  and  tn* 
aechsnlsm  which  would  be  used  to  present 
aetenel  to  the  speeker  for  recording.  The 
operator  advised  the  subjects  to  speak  as 
naturally  ss  possible.  The  speakers  were 
told  that  IT  they  caught  themselves  saying 
soaething  incorrectly,  they  were  to  say  th* 
word  'correction"  end  then  repeat  th*  word  or 
phres*  being  displayed.  They  were  also 
advised  that  th*  operator  would  redisplay  th* 
sase  word  or  phrase  for  re-utterance  If  th* 
operator  fell  th*  speeker  had  alsspoken  th* 
word  or  phres*.  Th*  operator  also  discussed 
sny  idiosyncrasies  which  eight  be  present  in 
th*  content  of  *  given  coaponent  of  th*  data 
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base.  For  exaaple,  If  the  subjects  wara 
recording  the  digit  airing  coaponant,  they 
ware  cautioned  to  aay  “taro"  instead  of  *oh" 
IF  tha  symbol  *0*  appeared  In  the  digit 
airing  aequence.  If  the  subjects  ware 
recording  one  of  tha  dlagnoatlc  rhyme  eom- 
ponenta,  they  uere  told  that  a  feu  of  the 
uorda  had  coaaonly  uacd  multiple  pronuncla- 
tlona.  Since  only  one  of  thaae  pronuncla- 
tiona  uaa  acceptable  for  thla  teat,  the  sub¬ 
jects  uere  advlaed  that  tha  conatruction 
■aounda  like  ...*  followed  by  a  rhyming  word 
would  appear  to  aaaiat  them  in  deducing  the 
correct  pronunciation.  During  thia  briefing, 
the  subjects  were  free  to  aak  queatlona  and 
make  any  other  reaarka  which  would  add  to 
their  underetanding  of  the  taak  at  hand . 

After  completing  thla  briefing,  the 
operator  placed  the  headaet  on  the  apeaker 
and  properly  positioned  the  Shure  SH10  uni¬ 
directional  microphone.  The  apeaker  was  cau¬ 
tioned  about  touching  thla  apparatua  or  mak¬ 
ing  any  head  aoveaenta  which  might  alter  ita 
'position.  Then,  the  operator  left  the 
recording  room  and  returned  to  the  operator 
station  in  the  adjacent  room  of  the  labora¬ 
tory. 

Next,  the  operator  performed  a  micro¬ 
phone  test  prior  to  initiating  recording.  In 
tbit  teat,  the  speakers  were  shown  a  series 
of  four  phrases  and  asked  to  aay  than  using 
their  natural  speaking  voice.  The  operator 
adjusted  the  recording  level  for  the  given 
apeaker  by  aonltorlng  the  VU  aeter  reading  on 
the  tape  recorder.  The  level  waa  considered 
to  be  properly  adjusted  when  the  peaks  of  the 
recording  level  were  barely  into  the  positive 
db  range  and  the  average  recording  level  was 
roughly  -3  to  -5  db.  For  some  speakers,  this 
microphone  test  had  to  be  repeated  two  or 
more  times  before  the  operator  waa  satisfied 
with  the  recording  level  calibration.  Occa¬ 
sionally,  the  operator  made  additional  micro¬ 
phone  tests  during  the  session,  if  the  Ini¬ 
tial  calibration  setting  became  unacceptable 
due  to  a  change  in  the  level  of  a  speaker's 
speech. 

Once  the  microphone  test  was  completed, 
the  recorder  was  started  and  the  subject 
recorded  a  preamble  message  which  identified 
the  date  of  the  recording,  the  speaker  by 
name,  the  speaker  number,  and  the  session 
content.  The  recorder  was  then  stopped  long 
enough  to  remind  the  speaker  about  the  con¬ 
tent  and  any  idiosyncrasies  of  the  first  seg¬ 
ment  of  speech  to  be  recorded.  The  session 
then  proceeded  with  the  operator  controlling 
the  pace  of  the  the  pre-defined  sequence  of 
material  which  the  apeaker  recorded.  After 
each  vocabulary  repetition  or  test  phrase 
section  was  coapleted,  the  recorder  waa  again 
briefly  stopped  to  realnd  the  speaker  about 
the  content  of  the  segment  which  followed. 

A  total  of  50  speakers  were  recorded;  25 
sales  and  25  females.  In  response  to  con¬ 
tractual  commltaents,  the  vast  majority  of 
these  individuals  had  little  or  no  previous 
experience  ea  subjects  in  experiments  involv¬ 
ing  voice  recording  for  speech  recognition 
purposes.  The  subject  population  was  drawn 
froo  two  sources.  Roughly  one-half  of  the 
speakers  were  employees  of  the  ITT  Defense 
Communications  Division  San  Diego  facility  at 
the  time  the  recordings  were  made.  The  other 
half  of  the  speaker  population  was  affiliated 
uitn  a  temporary  employment  agency  located  in 


the  San  Dlago  area. 

Hale  subjects  ranged  in  age  from  21-SI; 
the  median  age  was  29.  Female  aubjects 
ranged  in  age  from  18-56;  the  median  age  was 
also  29.  Since  the  San  Diego  population 
represents  e  melting  pot  of  people  from 
throughout  the  United  States,  the  subject 
population  represents  a  broad  nix  of  native 
American  dialects. 

The  SO  subjects  were  divided  into  four 
type  classifications.  Regardless  of  the 
classification  category,  each  apeaker  first 
recorded  training  and  teat  material  from  Air¬ 
line  Sets  1-2. 

Then,  depending  upon  the  classification 
type,  the  speeker  recorded  training  and  test 
utterances  from  other  components  of  the  data 
base  as  identified  in  Table  1.  Type  A  speak¬ 
ers  recorded  material  from  Airline  Sets  3-A; 
Type  B  speakers  recorded  material  from  the 
Diagnostic  Rhyme  teat  involving  "contrasting 
initial  consonants;  Type  C  speakers  recorded 
material  fros  the  Diagnostic  Rhyme  teat 
involving  contrasting  final  consonants;  and 
Type  D  speakers  recorded  material  from  the 
digit  string  and  alphabet/word  spelling  com¬ 
ponent  of  the  data  base. 

The  vast  majority  of  the  apeakera  per¬ 
formed  the  Airline  Set  1-2  task  in  2S  to  30 
minutes  of  actual  recording  time.  Each 
speaker  waa  given  a  20  minute  break  before 
beginning  tha  second  half  of  the  taak.  The 
recording  time  for  the  second  session  ranged 
from  approximately  18  minutes  to  37  minutes, 
depending  upon  the  speakers  and  the  classifi¬ 
cation  type. 


£.  3UHHART 

This  speech  data  base  plays  an  Integral 
part  in  the  ITT  Defense  Communications 
Division's  continuing  effort  to  develop  effi¬ 
cient,  effective  continuous  speech  recogni¬ 
tion  algorithms.  The  data  is  restricted  to 
speech  from  cooperative  speakers  recorded 
under  quiet  conditions.  Vlthln  the  boun¬ 
daries  of  this  environment,  however,  the 
scope  of  the  data  base  la  quite  broad. 
Speech  has  been  recorded  from  SO  apeakera  and 
involves  training  and  test  utterances  from 
seven  different  components.  Four  of  those 
are  representative  of  limited  syntax  applica¬ 
tion  areas  for  speech  recognition.  The  oth¬ 
ers  Involve  connected  digits,  use  of  the 
alphabet  in  a  continuous  speech  manner  to 
spell  words,  and  a  diagnostic  rhyme  com¬ 
ponent.  Special  care  waa  taken  to  produce 
quality  recordings.  The  spoken  material  waa 
displayed  on  a  video  monitor.  Prompting  and 
pacing  of  thia  material  was  controlled  by  an 
operator  positioned  in  an  adjacent  room. 
Although  the  speaker  population  was  drawn 
from  only  two  sources  -  ITT  Defense  Communi¬ 
cations  Division  employees  and  personnel  from 
a  temporary  employment  agency  -  a  good  demo¬ 
graphic  sixture  with  respect  to  ege,  dielect, 
and  physical  site,  was  obtained.  The  record¬ 
ings  also  contain  typical  r.on-speech  aounda, 
such  as  lip  smacking,  tongue  clicking  and 
breathing.  In  addition,  speaking  styles 
varied  from  e  rapid  and  heavily  coarticu¬ 
lated,  to  slow  and  deliberate.  The  data  base 
thus  provides  a  valuable  tool  for  developing 
and  testing  speech  recognition  systeas. 


-  B-7  - 


INTERPRETATION  GUIDE  FOR  EXPERIMENT  SUMMARIES 


'  speaker  nn"  - 

"ALL  WORDS"  - 

"SEMANTIC"  - 

"CORRECT'  - 

"OPTION  n"  - 

"ERRORS" 


nn  is  speaker  number  followed  by 
sex,  age,  and  highest  level  of  education 

recognition  results  for  ail  vocabulary  words. 

recognition  results  excluding  "of, for, the" 

number  of  words  or  phrases  correctly  identified. 


number  of  phrases  for  which  the  CSR  system's 
nth  candidate  phrase  was  correct. 


number  of  phrases  for  which  none  of  CSR  system's 
candidate  phrases  were  eorrect. 


-C-2  - 


EXPERIMENT  SUMMARY 


speaker  01:  male,  29,  high  school 

ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

=  50 

CORRECT  = 

18 

36.07. 

36 

76  07. 

OPTION  §2  = 

ie 

24  07. 

5 

10.0% 

OPTION  §2  = 

1 

2  07. 

1 

2.0% 

OPTION  jf4  = 

1 

2  07. 

1 

2.0% 

OPTION  #5  = 

2 

4.0% 

0 

0.0% 

ERRORS  = 

16 

32  07. 

5 

10  0% 

WORD  TRIALS  = 

•  353 

286 

CORRECT  = 

30B 

273 

INSERTIONS  = 

0 

1 

DELETIONS  = 

12 

2 

WORD  RATE  = 

87  3% 

95.1% 

EXPERIMENT  SUMMARY 
speaker  02:  female,  28,  high  school 

ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

=  50 

CORRECT  = 

39  78.0% 

46 

92.0% 

OPTION  j?2  = 

7  14.0% 

1 

2.0% 

OPTION  #3  = 

0  0.0% 

0 

0.0% 

OPTION  #4  = 

0  0  0% 

0 

0.0% 

OPTION  #5  = 

1  2.0% 

0 

0.0% 

ERRORS  = 

3  6  0% 

3 

6.0% 

WORD  TRIALS  = 

:  358 

291 

CORRECT  = 

346 

284 

INSERTIONS  = 

5 

3 

DELETIONS  = 

1 

1 

WORD  RATE  = 

95.3% 

96.6% 

EXPERIMENT  SUMMARY 

speaker  03:  female,  26,  junior  college 
ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

=  50 

CORRECT  = 

12 

24.0% 

27 

54.0% 

OPTION  #2  = 

17 

34.0% 

3 

6.0% 

OPTION  #3  = 

1 

2  0% 

2 

4.0% 

OPTION  #4  = 

1 

2.0% 

0 

0.0% 

OPTION  #5  = 

1 

2.0% 

1 

2.0% 

ERRORS  = 

18 

36.0% 

17 

34.0% 

WORD  TRIALS  = 

=  392 

313 

CORRECT  = 

339 

280 

INSERTIONS  = 

2 

0 

DELETIONS  = 

4 

2 

WORD  RATE  =  86  .0%  89.5% 


-C-3- 


EXPERIMENT  SUMMARY 

speaker  04  male.  22,  junior  college 

ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

— 

50 

CORRECT  = 

30 

60.0% 

42 

84.0% 

OPTION  #2  = 

6 

12.0% 

4 

8.0% 

OPTION  #3  = 

2 

4.0% 

1 

2.0% 

OPTION  #4  = 

0 

0.0% 

0 

0.0% 

OPTION  #5  = 

1 

2.0% 

1 

2.0% 

ERRORS  = 

11 

22  0% 

2 

4.0% 

WORD  TRIALS  =  375  299 

CORRECT  =  356  293 

INSERTIONS  =3  2 

DELETIONS  =9  0 

WORD  RATE  =  94.2%  97.3% 


EXPERIMENT  SUMMARY 
speaker  05.  male,  30,  bachelors 

ALL  WORDS  SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  37  74.0%  47  94.0% 

OPTION  #2=5  10.0%  0  0.0% 

OPTION  #3=0  0.0%  0  0.0% 

OPTION  #4  =  1  2.0%  1  2.0% 

OPTION  #5=0  0.0%  0  0.0 % 

ERRORS  =  7  14.0%  2  4.0% 


WORD  TRIALS  =  364  290 

CORRECT  =  352  269 

INSERTIONS  =2  2 

DELETIONS  =7  0 

WORD  RATE  =  96.2%  99.0% 


EXPERIMENT  SUMMARY 
speaker  06:  female,  30,  phd. 

ALL  WORDS  SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  27  54.0%  39  78.0% 

OPTION  #2  =  10  20.0%  3  6.0% 

OPTION  #3=4  8.0%  2  4.0% 

OPTION  #4  =  2  4.0%  1  2.0% 

OPTION  #5=0  0.0%  0  0.0% 

ERRORS  =  7  14.0%  5  10.0% 


WORD  TRIALS  =  353  276 

CORRECT  =  325  267 

INSERTIONS  =3  3 

DELETIONS  =6  0 

WORD  RATE  =  91.3%  95.0% 


-C-4- 


PHRASE  TRIALS  = 
CORRECT  =  40 
OPTION  #2=5 
OPTION  #3=1 
OPTION  #4=0 
OPTION  # 5=0 
ERRORS  =  4 


50 

80.0% 
10.0% 
2  0% 
0.0% 
0.0% 
8.0% 


44  88  0% 
3  6.0% 

0  0.0% 
0  0.0% 
0  0.0% 
3  6.0% 


WORD  TRIALS  =  380 
CORRECT  =  36B 
INSERTIONS  =  0 

DELETIONS  =  3 

WORD  RATE  =  96.8% 


30? 

299 

0 

1 

97.4% 


EXPERIMENT  SUMMARY 
speaker  08:  male,  34,  phd. 
ALL  WORDS 


PHRASE  TRIALS  = 
CORRECT  =  29 
OPTION  #2=  8 
OPTION  #3=  2 
OPTION  #4=  1 

OPTION  #5=  1 

ERRORS =  9 


50 

58.0% 

16.0% 

4.0% 

2.0% 

2,0% 

18.0% 


SEMANTIC 

41  82.0% 
5  100% 
1  2.0% 
0  0.0% 
1  2.0% 
2  4.0% 


WORD  TRIALS  =  388 
CORRECT  =  373 
INSERTIONS  =  9 

DELETIONS  =  6 

WORD  RATE  =  94  0% 


311 

304 

2 

0 

97.1% 


EXPERIMENT  SUMMARY 
speaker  09:  male,  30,  meters  Alsmr 
ALL  WORDS  SEMANTIC 


PHRASE  TRIALS  =  50 

CORRECT  =  27  54.0% 
OPTION  #2  =  14  28.0% 
OPTION  #3  =  1  2,0% 

OPTION  #4=3  6.0% 

OPTION  #5  =  1  2  0% 

ERRORS  =  4  8.0% 


40  80.0% 
7  14.0% 
0  0.0% 

1  2.0% 

1  2.0% 
1  2.0% 


WORD  TRIALS  =  375 
CORRECT  =  353 
INSERTIONS  1 


DELETIONS  = 
WORD  RATE  =  = 


10 

03  9% 


299 

290 

1 

0 

96.7% 


-C-5- 


EXPERIMENT  SUMMARY 
speaker  10:  female,  35,  bachelors 
!  ALL  WORDS  SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  34  68.0%  43  86.0% 

OPTION  # 2  =  8  16.0%  4  8.0% 

OPTION  #3  =  1  2.0%  0  0.0% 

OPTION  #4  =  1  2.0%  0  0.0% 

OPTION  §5  =  0  0.0%  0  0  0% 

ERRORS  =  6  12.0%  3  6.0% 

WORD  TRIALS  =  361  2B8 

CORRECT =  342  278 

INSERTIONS  =2  0 

DELETIONS  =10  4 

WORD  RATE  =  94.2%  96.5% 


EXPERIMENT  SUMMARY 
speaker  11:  female,  23,  junior  college 
ALL  WORDS  SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  35  70.0%  49  98.0% 

OPTION  #2=4  8.0%  0  0.0% 

OPTION  #3  =  1  2.0%  0  0.0% 

OPTION  #4=2  4.0%  0  0.0% 

OPTION  #5=1  2.0%  0  0.0% 

ERRORS  =  7  14.0%  1  2.0% 

WORD  TRIALS  =  353  278 

CORRECT  =  333  276 

INSERTIONS  =  1  0 

DELETIONS  =8  0 

WORD  RATE  =  94.1%  99  3% 


EXPERIMENT  SUMMARY 

speaker  12.  male,  32,  junior  college 

ALL  WORDS  SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  33  66.0%  42  84.0% 

OPTION  #2=5  10.0%  3  6.0% 

OPTION  #3=2  4.0%  0  0.0% 

OPTION  #4  =  1  2.0%  0  0.0% 

OPTION  #5  =  1  2.0%  1  2.0% 

ERRORS  =  B  16.0%  4  8.0% 


WORD  TRIALS  =  378  306 

CORRECT =  352  295 

INSERTIONS  =2  2 

DELETIONS  =9  0 

WORD  RATE  =  92.6%  95.8% 


-C-6-- 


EXPERIMENT  SUMMARY 
speaker  13:  female,  43,  bachelors 

ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

= 

50 

CORRECT  = 

40 

B0.0% 

46 

92.0% 

OPTION  #2  = 

5 

10.0% 

3 

6.0% 

OPTION  #3  = 

1 

2.0% 

0 

0  0% 

OPTION  #4  = 

0 

0.0% 

0 

0.0% 

OPTION  #5  = 

0 

0.0% 

0 

0.0% 

ERRORS  = 

4 

8.0% 

1 

2.0% 

WORD  TRIALS  =  382  310 

CORRECT  =  368  304 

INSERTIONS  =0  0 

DELETIONS  =  7  1 

WORD  RATE  =  96.3%  98.1% 


EXPERIMENT  SUMMARY 
speaker  14:  male,  22,  high  school 

ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

— 

50 

CORRECT  = 

26 

52.0% 

40 

80.0% 

OPTION  #2  = 

4 

8.0% 

4 

B.0% 

OPTION  #3  = 

1 

2.0% 

0 

0.0% 

OPITON  #4  = 

1 

2.0% 

0 

0.0% 

OPTION  #5  = 

1 

2.0% 

0 

0.0% 

ERRORS  = 

17 

34.0% 

6 

12.0% 

WORD  TRIALS  =  381  307 

CORRECT  =  348  291 

INSERTIONS  =  2  1 

DELETIONS  =13  3 

WORD  RATE  =  90.9%  94.5% 


EXPERIMENT  SUMMARY 
speaker  15:  male,  29,  bachelors 

ALL  WORDS  SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  32  64.0%  37  74.0% 

OPTION  »2  =  6  12.0%  6  12.0% 

OPTION  #3=4  8.0%  2  4.0% 

OPTION  #4  =  1  2,0%  1  2.0% 

OPTION  #5=0  0.0%  0  0.0% 

ERRORS  =  7  14.0%  4  8.0% 

WORD  TRIALS  =  359  286 

CORRECT  =  334  270 

INSERTIONS  =2  2 

DELETIONS  =  8  1 

WORD  RATE  =  92.5%  93.8% 


experiment  summary 

speaker  16:  male,  48,  bachelors 


all  words 

PHRASE  TRIALS  =  50 
CORRECT = 
OPTION  #2  = 
OPTION  #3  = 
OPTION  #4  = 
OPTION  #5  = 
ERRORS = 


19 

6 

3 

1 

0 

21 


SEMANTIC 


38.0% 
12.0% 
6,0% 
2.0% 
0.0% 
42  0% 


41 

1 

1 

0 

0 

7 


B2  0% 
2.0% 
2.0% 
0.0% 
0.0% 
14.0% 


WORD  TRIALS  =  355 
CORRECT  =  300 
INSERTIONS  =  4 

DELETIONS  =  1 4 

WORD  RATE  =  83.6% 


279 

256 

3 

3 

90.  B% 


EXPERIMENT  SUMMARY 
speaker  17:  male,  42,  masters 

r  .  «  «  tirririTAO 


all  words 

PHRASE  TRIALS  =  50 

CORRECT  =  37  74.0% 
OPHON  #2=5  10  0% 
OPTION  #3=2  4.0% 

OPTION  #4  =  1  2  0% 

OPTION  #5=0  0.0% 

ERRORS  s  5  10.0% 


SEMANTIC 


42 

4 

0 

0 

0 

4 


84.0% 

B.0% 

0.0% 

0.0% 

0.0% 

B.0% 


WORD  TRIALS  =  378 
CORRECT  =  365 
INSERTIONS  =  2 

DELETIONS  =  3 

WORD  RATE  =  96.1% 


306 

299 

1 

0 

97.4% 


experiment  SUMMARY 

speaker  18:  male,  33,  masters 

*  a  «  •  mrxTrnC?  Sf 


nnw  A  XTT*!  C* 


PHRASE  TRIALS  =  50 

CORRECT  =  21  42.0% 
OPTION  #2=  7  14.0% 
OPTION  #3=  3  6.0% 

OPTION  #4=  2  4.0% 

OPTION  #5  =  0  0.0% 

ERRORS  =  17  34.0% 


35  70.0% 
3  6.0% 

2  4.0% 

2  4.0% 

0  00% 
B  16.0% 


WORD  TRIALS  =  388 
CORRECT  =  349 
INSERTIONS  =  U 
DELETIONS  =  12 
WORD  RATE  =  87.5% 


311 

288 

8 

4 

90.9% 


-C-8- 


experiment  summary 

speaker  19:  male,  34,  bachelors 
ALL  WORDS  SEfc 
PHRASE  TRIALS  =  50 

CORRECT  =  39  78.0%  44 

OPTION  #2=9  18.0%  4 

OPTION  #3  =  1  2.0%  1 

OPTION  #4  =  0  0.0%  0 

OPTION  # 5=0  0.0%  0 

ERRORS  =  1  2.0%  1 

WORD  TRIALS  =  375  29' 

CORRECT  =  366  29‘ 

INSERTIONS  =3  3 

DELETIONS  =3  0 

WORD  RATE  =  96  8%  ( 


SEMANTIC 

44  88.0% 
4  8.0% 

1  2.0% 

0  0.0% 

0  0.0% 

1  2.0% 

299 

295 

3 

0 

97.7% 


EXPERIMENT  SUMMARY 
speaker  20  male,  29,  bachelors 


ALL  WORDS 
PHRASE  TRIALS  =  50 
CORRECT  =  37  74.0% 
OPTION  #2=  7  14.0% 
OPTION  #3  =  0  0  0% 

OPTION  #4  =1  2.0% 

OPTION  #5  =  0  0.0% 

ERRORS  =  5  10.0% 

WORD  TRIALS  =  364 
CORRECT  =  347 
INSERTIONS  =  l 
DELETIONS  =  5 

WORD  RATE  =  95  1% 


SEMANTIC 

42  84.0% 
5  10.0% 
0  0.0% 

0  0.0% 

0  0.0% 

3  6.0% 

290 

280 

0 

2 

96.6% 


EXPERIMENT  SUMMARY 
speaker  21:  male,  29,  bachelors 

AIT  _ 


ALL  WORDS 
PHRASE  TRIALS  =  50 
CORRECT  =  37  74.0% 
OPTION  #2=6  12.0% 
OPTION  #3  =1  2.0% 

OPTION  #4=2  4.0% 

OPTION  #5=0  0.0% 

ERRORS  =  4  8.0% 

WORD  TRIALS  =  353 
CORRECT  =  338 
INSERTIONS  =  2 

DELETIONS  =  5 

WORD  RATE  =  95.2% 


SEMANTIC 

44  88.0% 
4  8.0% 

0  0  0% 

0  0.0% 

0  0.0% 

2  4.0% 

278 

270 

0 

0 

97.1% 


EXPERIMENT  SUMMARY 
speaker  22:  male,  23,  bachelors 

ALL  WORDS  SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  38  76.0%  41  82  0% 

OPTION  #2=6  12.0%  4  8.0% 

OPTION  #3=0  0.0%  0  0.0% 

OPTION  #4  =  1  2.0%  1  2.0% 

OPTION  #5=0  0.0%  0  0.0% 

ERRORS  =  5  10.0%  4  8.0% 

WORD  TRIALS  =  378  306 

CORRECT  =  366  298 

INSERTIONS  =  1  2 

DELETIONS  =2  0 

WORD  RATE  =  96.6%  96.8% 


EXPERIMENT  SUMMARY 
speaker  23:  male,  33,  masters 

ALL  WORDS  SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  26  52.0%  40  80.0% 

OPTION  #2=9  18.0%  4  B.0% 

OPTION  #3  =  3  6  0%  0  0.0% 

OPTION  #4=0  0.0%  1  2.0% 

OPTION  #5=0  0.0%  0  0.0% 

ERRORS  =  12  24.0%  5  10.0% 

WORD  TRIALS  =  391  314 

CORRECT  =  363  300 

INSERTIONS  =4  2 

DELETIONS  =13  3 

WORD  RATE  =  919%  94.9% 


EXPERIMENT  SUMMARY 
speaker  24:  female,  36,  phd. 

ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

— 

50 

86.0% 

CORRECT  = 

35 

70.0% 

43 

OPTION  #2  = 

7 

14.0% 

4 

8.0% 

OPTION  #3  = 

1 

2.0% 

1 

2.0% 

OPTION  #4  = 

0 

0.0% 

0 

0.0% 

OPTION  #5  = 

1 

2.0% 

0 

0.0% 

ERRORS  = 

6 

12.0% 

2 

4.0% 

WORD  TRIALS  =  360  302 

CORRECT  =  362  294 

INSERTIONS  =  1  1 

DELETIONS  =  8  1 

WORD  RATE  =  95.0%  97.0% 


EXPERIMENT  SUMMARY 
speaker  25:  female,  43,  high  school 

ALL  WORDS  SEMANTIC 
PHRASE  TRIALS  =  50 

CORRECT  =  35  70  0%  41  82.0% 

OPTION  # 2=7  14.0%  3  6.0% 

OPTION  #3=0  0.0%  0  0.0% 

OPTION  #4=0  0.0%  0  0.0% 

OPTION  #5  =  0  0.0%  1  2.0% 

ERRORS  =  8  16.0%  5  10.0% 

WORD  TRIALS  =  373  301 

CORRECT  =  356  289 

INSERTIONS  =  3  1 

DELETIONS  =  6  1 

WORD  RATE  =  94.7%  95.7% 


EXPERIMENT  SUMMARY 
speaker  26:  male,  32,  masters 

ALL  WORDS  SEMANTIC 
PHRASE  TRIALS  =  50 


CORRECT  =  43 

86.0% 

47 

94.0% 

OPTION  #2=2 

4.0% 

2 

4.0% 

OPTION  #3  =  1 

2.0% 

0 

0.0% 

OPTION  #4=0 

0.0% 

0 

0.0% 

OPTION  #5=0 

0.0% 

0 

0.0% 

ERRORS =  4 

8.0% 

1 

2.0% 

WORD  TRIALS  =  352  277 

CORRECT  =  344  273 

INSERTIONS  =0  0 

DELETIONS  =  4  1 

WORD  RATE  =  97.7%  98.6% 


EXPERIMENT  SUMMARY 
speaker  27:  female,  28,  masters 

ALL  WORDS  SEMANTIC 
PHRASE  TRIALS  =  50 


CORRECT  =  35 

78.0% 

45 

90.0% 

OPTION  #2=6 

12.0% 

2 

4.0% 

OPTION  #3  =  1 

2.0% 

0 

0.0% 

OPTION  #4  =  1 

2.0% 

1 

2.0% 

OPTION  #5=0 

0.0% 

0 

0.0% 

ERRORS  =  3 

6.0% 

8 

4.0% 

WORD  TRIALS  =  378  306 

CORRECT  =  365  300 

INSERTIONS  =  1  0 

DELETIONS  =  5  1 

WORD  RATE  =  £8.3%  98.0% 


-C-ll- 


experiment  summary 

speaker  28:  female,  58,  bachelors 

ALL  WORDS  SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  40  80. 0%  47  94.0% 

OPTION  #2=5  10  0%  2  4.0% 

OPTION  #3  =  1  2.0%  0  0.0% 

OPTION  #4=0  0.0%  0  0.0% 

OPTION  #5  =  0  0  0%  0  0.0% 

ERRORS  =  4  B.0%  1  2.0% 

WORD  TRIALS  =  342  277 

CORRECT  =  333  274 

INSERTIONS  =  1  0 

DELETIONS  =  1  0 

WORD  RATE  =  97. 1  %  98  97. 


EXPERIMENT  SUMMARY 

speaker  29:  male,  34,  junior  college 

ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

= 

50 

86.0% 

CORRECT  = 

33 

68.0% 

43 

OPTION  #2  = 

7 

14.0% 

4 

8.0% 

OPTION  #3  = 

2 

4.0% 

1 

2.0% 

OPTION  #4  = 

1 

2.0% 

0 

0.0% 

OPTION  #5  = 

1 

2.0% 

1 

2.0% 

ERRORS  = 

6 

12.0% 

1 

2.0% 

WORD  TRIALS  =  375  299 

CORRECT  =  355  292 

INSERTIONS  =0  0 

DELETIONS  =9  O 

WORD  RATE  =  94.7%  97.7% 


EXPERIMENT  SUMMARY 
speaker  30:  male,  26,  bachelors 

ALL  WORDS  SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  25  50.0%  39  78.0% 

OPTION  #2=6  12.0%  5  10.0% 

OPTION  #3  =  i  2  0%  2  4.0% 

OPTION  #4  =  l  2  0%  0  0.0% 

OPTION  #5  =  1  2  0%  0  0  0% 

ERRORS  =  16  32.0%  4  8.0% 

WORD  TRIALS  =  364  290 

CORRECT  =  338  279 

INSERTIONS  =  7  1 

DELETIONS  =  10  1 

WORD  RATE  =  91.1%  95.9% 


EXPERIMENT  SUMMARY 


speaker  31:  male,  27.  junior  college 

ALL  WORDS  SEMANTIC 


PHRASE  TRIALS  =  50 

CORRECT  =  37  74.0%  44  68. 0% 

OPTION  #2  =  10  20.0%  5  10.0% 

OPTION  # 3  =  1  2.0%  0  0.0% 

OPTION  #4  =  1  2.0%  0  0.0% 

OPTION  #5=0  0.0%  0  0.0% 

ERRORS  =  1  2.0%  1  2.0% 


WORD  TRIALS  =  357  291 

CORRECT  =  348  2BB 

INSERTIONS  =4  3 

DELETIONS  =3  0 

WORD  RATE  =  96.4%  96.0% 


EXPERIMENT  SUMMARY 
speaker  32:  female,  24,  junior  college 
ALL  WORDS  SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  39  78.0%  47  94.0% 

OPTION  #2=6  12.0%  1  2.0% 

OPTION  #3  =  1  2.0%  0  0.0% 

OPTION  #4  =  O  0  0%  0  0.0% 

OPTION  #5  =  1  2.0%  0  0.0% 

ERRORS  =  3  6.0%  2  4.0% 

WORD  TRIALS  =  377  303 

CORRECT  =  364  300 

INSERTIONS  =0  0 

DELETIONS  =7  0 

WORD  RATE  =  96.6%  99.0% 


EXPERIMENT  SUMMARY 
speaker  33:  female,  51,  junior  college 
ALL  WORDS  SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  37  74.0%  45  90. 0% 

OPTION  #2=5  10.0%  2  4.0% 

OPTION  #3=2  4.0%  2  4.0% 

OPTION  #4=1  2.0%  0  0.0% 

OPTION  #5  =  0  0.0%  0  0.0% 

ERRORS  =  5  10.0%  1  20% 

WORD  TRIALS  =  371  299 

CORRECT  =  364  295 

INSERTIONS  =  2  1 

DELETIONS  =5  0 

WORD  RATE  =  96.6%  98.3% 


-C-l 3  - 


EXPERIMENT  SUMMARY 

speaker  34:  female,  48,  high  school 

ALL  WORDS 

SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  48  96.0% 

50  100.0% 

OPTION  #2=2  4.0% 

0  0.0% 

OPTION  #3  =  0  0  0% 

0  0.0% 

OPTION  #4=0  0.0% 

0  0.0% 

OPTION  #5=  0  0  0% 

0  0.0% 

ERRORS  =  0  0.0% 

0  0.0% 

WORD  TRIALS  =  358 

291 

CORRECT  =  356 

291 

INSERTIONS  =  0 

0 

DELETIONS  =  1 

0 

WORD  RATE  =  99.4% 

100.0% 

EXPERIMENT  SUMMARY 

speaker  35:  female,  24,  junior  college 

I  ALL  WORDS 

SEMANTIC 

1  PHRASE  TRIALS  =  50 

CORRECT  =  33  66.0% 

38  76.0% 

OPTION  #2=5  10.0% 

2  4.0% 

OPTION  #3=  3  6.0% 

2  4.0% 

'  OPTION  #4  =  1  2.0% 

0  0.0% 

OPTION  #5  =  1  2.0% 

1  2.0% 

ERRORS  s  7  14.0% 

7  14.0%  } 

| 

WORD  TRIALS  =  375 

303 

CORRECT  =  357 

289 

INSERTIONS  =  5 

3 

DELETIONS  =  3 

1 

WORD  RATE  =  93.9% 

94.4% 

EXPERIMENT  SUMMARY 

speaker  36:  female,  29,  bachelors 

ALL  WORDS 

SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  34  68  0% 

45  90.0% 

OPTION  #2=9  18  0% 

3  6.0% 

OPTION  #3=2  4.0% 

1  2.0% 

OPTION  #4  =1  2  0% 

0  0.0% 

OPTION  #5=0  0,0% 

0  0.0% 

ERRORS  =4  8.0% 

1  2.0% 

WORD  TRIALS  =  340 

279 

'  CORRECT  =  325 

274 

INSERTIONS  =  1 

0 

DELETIONS  =  3 

0 

WORD  RATE  =  95.3% 

98  2% 

-C 

-14- 

EXPERIMENT  SUMMARY 
speaker  37:  female,  28,  masters 

ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

= 

50 

CORRECT  = 

16 

32.0% 

23 

46.0% 

OPTION  #2  = 

6 

12  0% 

3 

6  0% 

OPTION  #3  = 

2 

4.0% 

1 

2.0% 

OPTION  #4  = 

0 

0.0% 

0 

0.0% 

OPTION  #5  = 

1 

2.0% 

1 

2.0% 

ERRORS  = 

25 

50.0% 

22 

44.0% 

WORD  TRIALS  =  368  307 

CORRECT  =  294  252 

INSERTIONS  =  5  1 

DELETIONS  =  24  17 

WORD  RATE  =  78  8%  01.8% 


EXPERIMENT  SUMMARY 
speaker  38:  male,  24,  bachelors 

ALL  WORDS  SEMANTIC 

PHRASE  TRIALS  =  50 

CORRECT  =  26  52  0%  39  78.0% 

OPTION  #2=9  18.0%  4  8.0% 

OPTION  #3=0  0.0%  0  0.0% 

OPTION  #4  =  1  2.0%  0  0.0% 

OPTION  #5  =  1  2.0%  1  2.0% 

ERRORS  =  13  26.0%  6  12.0% 

WORD  TRIALS  =  367  305 

CORRECT  =  338  292 

INSERTIONS  =5  3 

DELETIONS  =14  2 

WORD  RATE  =  90.9%  94.8% 


EXPERIMENT  SUMMARY 
speaker  39:  female,  28,  junior  college 
ALL  WORDS  SEMANTIC 


PHRASE  TRIALS  = 

50 

CORRECT  =  40 

80.0% 

44 

BB.0% 

OPTION  #2=4 

8.0% 

1 

2.0% 

OPTION  y/3  =  4 

8.0% 

3 

6.0% 

OPTION  #4=0 

0.0% 

0 

0.0% 

OPTION  #5=0 

0.0% 

0 

0.0% 

ERRORS  =  2 

4.0% 

2 

4.0% 

WORD  TRIALS  =  367  30? 

CORRECT  =  355  299 

INSERTIONS  =0  0 

DELETIONS  =2  0 

WORD  RATE  =  96.7%  97.4% 


-C-15- 


0 

0 

0 

11 


6.0% 

0.0% 

0.0% 

0.0% 

22.0% 


43 


experiment  summary 

speaker  b.c 

OPTION  #2=3 
OPTION  #3  = 

OPTION  #4  = 

OPTION  #5  = 

errors  = 

WORD  TRIALS  =  379 
CORRECT  =  363 
INSERTIONS  =  0 

DELETIONS  =  8 

WORD  RATE  =  95 


86.0% 

10.0% 

0.0% 

0.0% 

0.0% 

4.0% 


307 

300 

0 

0 

97.7% 


experiment  summary  schoo. 

speaker  4  klemale.s  ,  SEMANTIC 


phrase  trials 

CORRECT  =  34 
OPTION  #2=2 
OPTION  #3  = 
OPTION  #4  = 
OPTION  #5  = 

errors  = 


50 
68.0% 
4.0% 
2.0% 
4.0% 
2.0% 
20.0% 


WORD  TRIALS  -  357 
CORRECT=  338 
INSERTIONS  =  3 

DELETIONS  =  3 

WORD  RATE  = 


93.9% 


41 

2 

2 

1 

0 

4 


82.0% 
4.0% 
4.0% 
2.0% 
0  0% 
8.0% 


288 

277 

1 

1 


95  B% 


EXPERIMENT  summary 

speake- 4*.le^f 


all  words 
phrase  TRIALS  =50 
CORRECT*  29  58.0% 
OPTION  #2=2 
OPTION  #3  = 

OPTION  #4  = 

OPTION  #5  = 

errors  = 

WORD  TRIALS  =  374 
CORRECTS  352 
INSERTIONS  =  11 

deletions  =  4 
WORD  RATE=  914% 


5 

1 

2 

11 


SEMANTIC 


4.0% 
10.0% 
2.0% 
4.0% 
22.0% 


39 

2 

3 

1 

0 

5 


780% 

4.0% 

6.0% 

2.0% 

0.0% 

100% 


302 

290 

6 

0 


94  2% 


-C-16- 
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EXPERIMENT  SUMMARY 


speaker  43:  male.  21,  junior  college 

ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

=  50 

CORRECT  = 

32 

64.0% 

40 

80.0% 

OPTION  #2  = 

4 

8  0% 

4 

8.0% 

OPTION  #3  = 

1 

2.0% 

0 

0.0% 

OPTION  #4  = 

1 

2.0% 

0 

0.0% 

OPTION  #5  = 

0 

0.0% 

1 

2.0% 

ERRORS  = 

12 

24.0% 

5 

10  0% 

WORD  TRIALS  - 

=  386 

313 

CORRECT  = 

362 

301 

INSERTIONS  = 

2 

1 

DELETIONS  = 

13 

1 

WORD  RATE  = 

93.3% 

95.9% 

EXPERIMENT  SUMMARY 
speaker  44:  female,  36,  junior  college 
ALL  WORDS  SEMANTIC 
PHRASE  TRIALS  =  50 

CORRECT  =  37  74.0%  41  62.0% 

OPTION  #2  =  4  8.0%  1  2.0% 

OPTION  #3=3  6.0%  2  4.0% 

OPTION  #4=0  0.0%  0  0.0% 

OPTION  #5  =  1  2.0%  1  2.0% 

ERRORS  =  5  10.0%  5  10.0% 

WORD  TRIALS  =  386  314 

CORRECT  =  372  306 

INSERTIONS  =  0  1 

DELETIONS  =  1  0 

WORD  RATE  =  96.4%  97.1% 


EXPERIMENT  SUMMARY 
speaker  45:  female.  16,  high  school 

ALL  WORDS  SEMANTIC 
PHRASE  TRIALS  =  50 

CORRECT  =  25  50.0%  41  82.0% 

OPTION  #2=  5  10.0%  3  6.0% 

OPTION  #3=0  0.0%  0  0.0% 

OPTION  #4=0  0.0%  0  0.0% 

OPTION  #5=0  0.0%  0  0.0% 

ERRORS  =  20  40.0%  6  12.0% 

WORD  TRIALS  =  343  281 

CORRECT  =  309  266 

INSERTIONS  =8  3 

DELETIONS  =13  2 

WORD  RATE  =  88.0%  93.7% 


EXPERIMENT  SUMMARY 
speaker  46:  female,  29,  junior  college 
ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

= 

50 

CORRECT  = 

34 

68.0% 

46 

92.0% 

OPTION  § 2  = 

3 

6  0% 

1 

2.0% 

OPTION  §3  = 

5 

10  0% 

1 

2.0% 

OPTION  #4  = 

1 

2.0% 

0 

0.0% 

OPTION  / 5  = 

2 

4.0% 

0 

0.0% 

ERRORS  = 

5 

10.0% 

2 

4.0% 

WORD  TRIALS  =  368  307 

CORRECT  =  351  304 

INSERTIONS  =  1  1 

DELETIONS  =2  0 

WORD  RATE  =  95.1%  98.7% 


EXPERIMENT  SUMMARY 
speaker  47:  female,  35,  bachelors 

ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

= 

50 

CORRECT  = 

39 

78.0% 

46 

92.0% 

OPTION  #2  = 

8 

16.0% 

3 

6.0% 

OPTION  #3  = 

1 

2.0% 

0 

0.0% 

OPTION  #4  = 

0 

0.0% 

0 

0.0% 

OPTION  #5  = 

0 

0.0% 

0 

0.0% 

ERRORS  = 

2 

4.0% 

1 

2.0% 

WORD  TRIALS  =  368  307 

CORRECT  =  356  302 

INSERTIONS  =  1  0 

DELETIONS  =5  0 

WORD  RATE  =  96  5%  98  4% 


EXPERIMENT  SUMMARY 
speaker  48:  female,  36,  junior  college 
ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

= 

50 

CORRECT  = 

21 

42.0% 

25 

50.0% 

OPTION  #2  = 

6 

12.0% 

3 

6.0% 

OPTION  #3  = 

0 

0.0% 

1 

2.0% 

OPTION  #4  = 

1 

2.0% 

0 

0.0% 

OPTION  #5  = 

0 

0.0% 

0 

0.0% 

ERRORS  = 

22 

44.0% 

21 

42.0% 

WORD  TRIALS  =  369  306 

CORRECT  =  313  262 

INSERTIONS  =10  7 

DELETIONS  =7  3 

WORD  RATE  =  62.6%  83.7% 


EXPERIMENT  SUMMARY 

speaker  49:  male,  24,  junior  college 

ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

= 

50 

CORRECT  = 

32 

64.0% 

46 

92.0% 

OPTION  #2  = 

2 

4.0% 

0 

0.0% 

OPTION  #3  = 

1 

2.0% 

0 

0.0% 

OPTION  #4  = 

1 

2.0% 

0 

0.0% 

OPTION  #5  = 

1 

2.0% 

0 

0.0% 

ERRORS  = 

13 

26.0% 

4 

8.0% 

WORD  TRIALS  =  360  307 

CORRECT  =  342  297 

INSERTIONS  =0  0 

DELETIONS  =  16  2 

WORD  RATE  =  92.9%  96.7% 


EXPERIMENT  SUMMARY 

speaker  50:  male,  21,  junior  college 

ALL  WORDS  SEMANTIC 


PHRASE  TRIALS 

= 

50 

CORRECT  = 

24 

48.0% 

37 

74.0% 

OPTION  §Z  = 

4 

0.0% 

2 

4.0% 

OPTION  #3  = 

1 

2.0% 

0 

0.0% 

OPTION  #4  = 

0 

0.0% 

0 

0.0% 

OPTION  #5  = 

0 

0.0% 

0 

0.0% 

ERRORS = 

21 

42.0% 

11 

22.0% 

WORD  TRIALS  = 

:  338 

272 

CORRECT  = 

201 

242 

INSERTIONS  = 

11 

7 

DELETIONS  = 

13 

0 

WORD  RATE  = 

81.0% 

86.7% 

-C-19- 


